Boston's municipal digital archive holds tens of thousands of photographs, maps, and scanned documents — and a significant portion of them are duplicates. The problem didn't happen overnight. It accumulated across roughly 25 years of piecemeal digitization projects run independently by agencies that rarely talked to one another, leaving the city with a fractured, redundant visual record that archivists and researchers are now working to untangle.
The issue matters now because the Wu administration has prioritized open-data infrastructure as part of a broader transparency push. The Boston Digital Equity and Data Governance Initiative, launched in 2024, set a goal of making city records more accessible to residents in neighborhoods like Dorchester and Jamaica Plain, where community land trusts and housing advocates rely on historical imagery to document displacement and neighborhood change. Duplicate files slow search tools, inflate storage costs, and make it harder to surface the images that actually matter.
How the Redundancy Built Up
The roots of the problem trace back to the late 1990s, when individual departments — the Boston Planning & Development Agency on Atlantic Avenue, the Boston Public Library's Rare Books & Manuscripts department on Boylston Street, and the City Clerk's office on City Hall Plaza — each began scanning materials using different software, different naming conventions, and different metadata standards. Nobody coordinated. A single historic photograph of Dudley Square might have been scanned by three separate departments, saved under three different filenames, and uploaded to three different servers.
The Boston Public Library's Digital Repository alone catalogues more than 200,000 items. Archivists there have estimated internally that duplicate or near-duplicate entries account for a meaningful share of the collection, though a full audit figure has not been publicly released. The problem is compounded by the fact that early scanning runs often produced both a high-resolution master file and a compressed web-ready version, with both versions sometimes entered as separate catalog records rather than linked representations of the same item.
Northeastern University's library system, which shares digitization infrastructure with several city programs through a partnership formalized in 2021, identified the cross-institutional duplication issue in a 2023 internal review. Staff there flagged that images from the Boston City Archives were appearing in multiple partner repositories without cross-reference links, effectively orphaning context and provenance data.
The Cleanup Effort Taking Shape
The practical work of deduplication — identifying redundant files, selecting canonical versions, and retiring or merging the rest — is now being led by the Boston Archives and Records Management division, which operates out of a facility in West Roxbury. The division received a supplemental allocation in the fiscal year 2026 city budget to hire two additional digital records specialists, positions that were posted in March 2026.
The technical approach relies on perceptual hashing, a method that compares images mathematically rather than pixel-by-pixel, allowing archivists to catch duplicates even when files have been resized, recompressed, or had their metadata stripped. Staff are also working through a backlog dating to a 2018 server migration that generated thousands of orphaned image files when the city moved storage vendors.
For residents and researchers who use the Boston City Archives portal or the BPL's Digital Commonwealth platform, the practical changes should eventually be visible in cleaner search results and faster load times. Community groups in Jamaica Plain that use historical aerial photographs to track green space loss and infill development have long complained about search results cluttered with redundant entries. The deduplication project is expected to run through at least the end of calendar year 2026, with a phased rollout of cleaned collections beginning as early as September. Researchers who need specific collections in the meantime can submit priority requests directly to the Archives and Records Management division at its West Roxbury location.