Boston's public institutions collectively store an estimated tens of millions of digital image files across fragmented server systems — and a significant portion of those files are exact or near-exact duplicates, according to data management professionals working with city and university clients in the region. The problem is no longer abstract. Storage costs are measurable, retrieval times are slower, and the administrative burden falls on staff who have other jobs to do.
The timing matters. Mayor Michelle Wu's city government has pushed hard on digital modernization since 2022, and several departments are now mid-transition between legacy file systems and cloud-based infrastructure. That transition is exposing redundancy problems that older siloed storage quietly buried. When agencies at City Hall on School Street begin migrating records to unified platforms, duplicate image files don't disappear — they multiply, because migration tools often copy rather than replace.
What the Data Actually Shows
Industry benchmarks from enterprise data management firms suggest that between 20 and 40 percent of files in unmanaged digital asset libraries are duplicates or near-duplicates. For a municipal government the size of Boston — which serves roughly 675,000 residents across 23 neighborhoods — that translates into a substantial storage footprint. Cloud storage pricing from major providers currently runs between $0.02 and $0.023 per gigabyte per month for standard-tier storage, meaning even a modest 50-terabyte redundancy problem costs an institution upward of $13,800 annually in pure storage fees, before accounting for staff time spent managing the bloat.
Boston Public Library, which operates 25 branch locations including the Central Library on Boylston Street in Copley Square, has been digitizing historical photograph collections for years through its Digital Commonwealth program. Large-scale digitization projects like that one routinely generate duplicate derivatives — thumbnail versions, web-optimized copies, archival masters — that can triple or quadruple the raw file count without tripling the informational value. The same pattern plays out at Northeastern University's archives on Huntington Avenue and at Massachusetts General Hospital's medical imaging administrative systems in the Longwood Medical Area.
The MBTA, currently under a federal safety management inspection program administered by the Federal Transit Administration, has its own version of this problem in engineering and inspection photography. Field inspectors photographing track conditions on the Green Line extension or Red Line infrastructure near JFK/UMass Station upload images to department servers without systematic deduplication. Over a five-year period, that practice compounds into storage inefficiency that competes for the same IT budget dollars needed for operational improvements.
Cleaning It Up — and What It Actually Costs
Deduplication software has become a competitive market. Platforms marketed to mid-size institutions typically charge between $5,000 and $25,000 annually for enterprise licensing, depending on the volume of assets under management. Some open-source alternatives exist, but they require technical staff capacity that many Boston nonprofits and smaller city agencies simply do not have on hand.
The Boston Planning Department, which rebranded from the Boston Planning and Development Agency in 2024, maintains a public-facing GIS mapping portal that relies on regular aerial and street-level image ingestion. Geospatial image files are among the largest and most frequently duplicated asset types in municipal data systems, because the same block can be photographed under multiple project IDs and stored in separate departmental folders with no cross-reference.
For institutions looking to get ahead of the problem, data professionals working in the region generally recommend three steps: a full audit of existing storage to establish a baseline file count, deployment of hash-based deduplication tools that identify byte-for-byte identical files before tackling near-duplicates, and the establishment of a naming convention policy enforced at the point of upload rather than retroactively. Boston's Office of Digital Innovation, which sits within the city's Department of Innovation and Technology on City Hall Plaza, has the authority to set citywide data standards — making it the logical home for any coordinated policy response. Whether that office moves to formalize such standards before the next budget cycle begins in the fall of 2026 will determine how much of this problem gets cheaper to solve and how much gets more expensive to ignore.