Boston's public agencies and major institutions are sitting on millions of duplicate image files — redundant photographs, scanned documents, and permit attachments that clog servers, inflate storage contracts, and cost taxpayers money they don't have to spend. A growing push to quantify and clean up that digital clutter is revealing just how expensive the problem has become.
The issue is not unique to Boston, but the city's particular mix of large public institutions — the MBTA, Boston City Hall's Inspectional Services Department, the Boston Public Library's digital archive on Boylston Street — makes the numbers here worth examining closely. Each agency maintains its own image repository, often with little coordination, and the result is predictable: the same file, stored four or five times across different servers, drawing down the same budget line year after year.
The Numbers That Drive the Problem
Industry benchmarks from data management researchers suggest that duplicate files can account for between 20 and 30 percent of total storage consumption in large public-sector environments. For a mid-size city agency running a 50-terabyte archive — a reasonable estimate for a department like Inspectional Services, which photographs every permitted construction site in neighborhoods from Dorchester to East Boston — that translates to 10 to 15 terabytes of redundant data sitting on drives that cost real money to maintain. Enterprise cold storage typically runs between $15 and $25 per terabyte per month on major cloud platforms, meaning even a modest duplicate problem can add $1,800 to $4,500 annually to a single department's bill before anyone notices.
The MBTA, which has undergone significant internal reform since the Fiscal and Management Control Board era and continues to modernize under pressure from the state legislature, uses image capture extensively — from track inspection cameras on the Red Line to bus fleet maintenance photography stored at the Southampton Street garage in Roxbury. Transit agencies of comparable size nationally have reported recovering 18 to 22 percent of their active storage capacity after running automated deduplication passes, according to published case studies from the American Public Transportation Association.
At the Boston Public Library's digital collections unit, the scale is different but the logic is the same. The BPL has been digitizing historical photographs and maps since the early 2000s, and the Boylston Street central branch hosts servers that include scans of Boston neighborhoods stretching back to the late 19th century. Archivists working with collections of that age routinely encounter the same image scanned at different resolutions, saved under different filenames, and stored in separate folders by different staff members over two decades. That kind of organic duplication is harder to catch with a simple hash-matching algorithm and requires more manual review time — which, at prevailing library specialist wages in Suffolk County, is not cheap.
What Deduplication Actually Looks Like in Practice
The technical solution is well understood. Deduplication software uses cryptographic hashing — assigning each file a unique fingerprint based on its exact contents — to identify files that are byte-for-byte identical. More sophisticated tools use perceptual hashing, which can flag images that are visually identical even if they were saved at different resolutions or with slightly different metadata. Several city governments, including those in Chicago and New York, have implemented deduplication as part of broader digital asset management overhauls in the past three years.
Mayor Michelle Wu's administration has emphasized technology modernization as part of its broader city operations agenda, and the Office of New Urban Mechanics, based at City Hall on Cambridge Street, has historically served as the incubator for projects that touch city data infrastructure. Whether a formal deduplication initiative is in active planning is not publicly confirmed, but the fiscal pressure is real: Boston's FY2026 budget allocated funds for digital services improvements, and storage overhead is a line item that budget analysts scrutinize.
For departments looking to act now, the practical first step is an inventory. Before any files are deleted, a complete audit of what exists, where it lives, and who last accessed it is essential. Jamaica Plain-based nonprofits that manage public housing records, for example, often use the same image management software as city agencies and face the same problem at smaller scale. Starting with a 90-day audit cycle, running deduplication tools on a sandboxed copy of the archive first, and flagging rather than auto-deleting matches are the steps data managers consistently recommend before any files are permanently removed.