Boston's municipal digital infrastructure has a clutter problem. Across city departments, the Boston Public Library's digital collections, and the sprawling research databases of institutions along Longwood Avenue, thousands of duplicate image files have accumulated over roughly a decade of piecemeal digitization — a backlog that officials are now scrambling to address as storage costs climb and public-records requests get harder to fulfill.
The timing matters. Mayor Michelle Wu's administration has pushed a transparency and open-data agenda since taking office in November 2021, and the promise of accessible public records is hard to keep when archives are disorganized. Duplicate images — scanned permit photos, heritage photographs, inspection records, planning documents — clog retrieval systems and inflate cloud storage bills at a moment when every city budget line is under scrutiny heading into fiscal year 2027.
A Decade of Fast Digitization, Slow Organization
The roots of the problem stretch back to the early 2010s, when Boston launched successive digitization drives without a unified file-naming or deduplication standard. The City of Boston Archives on School Street began scanning historical records in earnest around 2013. The Boston Planning and Development Agency followed with its own document management push. Northeastern University's library system, working partly under a federal grant program, ran a parallel effort to digitize neighborhood photographs from Roxbury and the South End. None of these projects talked to each other in any systematic way.
By the time cloud migration became standard practice — roughly 2018 to 2022 — the redundant files came along for the ride. A single inspectional photograph from a Jamaica Plain triple-decker might exist in four separate folders across two agencies, each with a slightly different filename and metadata tag. Multiply that across tens of thousands of records and the problem compounds fast.
The Boston Public Library's Digital Repository, accessible through its Bpl.org portal, holds more than 1.6 million digitized items. Library staff have acknowledged in public budget hearings that deduplication was not built into the original workflow. The cost of cloud storage for redundant files is not trivial: industry benchmarks put enterprise cloud image storage at roughly $0.023 per gigabyte per month on standard tiers, and even modest duplication rates across a collection of that scale can translate to tens of thousands of dollars in unnecessary annual spend.
What a Fix Actually Requires
Deduplication is not simply a matter of running a script. Hash-matching software can catch identical files, but near-duplicate images — the same photograph scanned at different resolutions, or with different color corrections applied — require more sophisticated perceptual hashing tools or manual review. The BPDA, which manages planning and zoning image records for neighborhoods including Dorchester and East Boston, has been piloting a metadata standardization effort since early 2025, but the work is slow and the staff hours required are significant.
The Massachusetts Office of Technology Services and Security has provided guidance to state agencies on records management, but municipal systems like Boston's operate under their own procurement and IT governance structures. The city's Department of Innovation and Technology, headquartered at City Hall on Cambridge Street, is the coordinating body — though its capacity to run a citywide deduplication project depends on budget allocations that won't be finalized until late summer 2026.
For residents and researchers who rely on these records — genealogists pulling historic photographs from the BPL, lawyers requesting inspectional services images, community groups in Dorchester tracking neighborhood development history — the practical effect is slower response times and occasional gaps when the authoritative version of a file is unclear.
The clearest path forward involves three things: adopting a single city-wide file-naming protocol, running automated deduplication on existing archives before the next cloud contract renewal, and training staff at each agency to follow consistent upload procedures. Several peer cities, including Chicago and Denver, have published open-source playbooks for exactly this kind of records cleanup. Boston's IT department is expected to issue a formal request for proposals for a deduplication audit by the end of the third quarter of 2026. When that contract lands, it will signal whether the cleanup is finally moving from conversation to action.