Boston's municipal digital archive system is carrying tens of thousands of duplicate image files — the residue of more than a decade of piecemeal digitization projects that were never coordinated under a single standard. The problem, long acknowledged inside City Hall on Cambridge Street but rarely aired publicly, has reached a point where several departments are now actively working to clean up the backlog before a planned migration to a unified content management platform later this year.
The timing matters. Mayor Michelle Wu's administration has pushed hard on open-government and civic-tech commitments, and a cluttered, redundant image archive undercuts the usability of the public-facing data portals the city has been promoting since 2022. When residents search property records, permitting documents, or planning maps through Boston's Analyze Boston portal, duplicate entries slow retrieval times and sometimes surface conflicting versions of the same document.
How the Duplication Accumulated
The roots go back to the early 2010s, when individual city agencies — the Boston Planning & Development Agency, the Inspectional Services Department, and the Office of Arts and Culture among them — each contracted separately for document scanning and digitization. There was no citywide metadata standard. A photograph of a Dorchester triple-decker submitted as part of a zoning variance could end up stored three or four times: once in the ISD's permit database, once in the BPDA's parcel files, and again in a legacy system that was never fully decommissioned after a 2017 infrastructure upgrade.
The problem compounded during the COVID-19 pandemic. Between March 2020 and mid-2021, staff working remotely uploaded scanned documents through multiple channels simultaneously — email attachments, shared drives, and direct portal uploads — because no one had written a clear protocol for remote document submission. Jamaica Plain's neighbourhood planning files, for instance, reportedly contain overlapping image sets from at least three separate upload events tied to the same 2020 rezoning review. The BPDA has since acknowledged the redundancy in that file set in general terms without providing a specific count.
A 2023 report from the city's Department of Innovation and Technology — a public document available through the Analyze Boston open data portal — identified duplicate asset management as one of the top five operational drains on the city's IT support workload. The report did not assign a dollar figure to the problem, but it flagged it as a priority for the next budget cycle. The Wu administration's fiscal year 2025 budget allocated $2.1 million to DoIT for infrastructure modernisation, a portion of which is directed at the archive consolidation project.
The Path to a Fix
The consolidation effort is centred on adopting a digital asset management system that uses perceptual hashing — a technique that identifies visually identical or near-identical images regardless of filename or file size — to flag duplicates before they enter the archive. DoIT has been piloting the approach with the Boston Public Library's Copley Square digitization team, which maintains its own image repository of historical Boston photographs and has dealt with similar redundancy issues in its Digital Commonwealth collections.
The practical rollout is scheduled in phases. The first phase, targeting ISD permit records for properties in Dorchester and East Boston, is expected to wrap before the end of calendar year 2026. Those two neighbourhoods were prioritised partly because they have the highest volume of active permitting in the city and partly because their records are most frequently requested under public records law.
For residents or researchers who regularly pull documents from Analyze Boston or the city's GIS open data hub, the near-term advice from DoIT is straightforward: if you download image files now and plan to use them for research or reporting, note the file's unique record ID rather than relying on the filename, since filenames are among the fields most likely to change during the deduplication sweep. The city has committed to publishing a changelog when the migration goes live, so users can match old identifiers to new ones without losing their work.