Boston's municipal archives hold more than a century of photographs documenting everything from the Big Dig's early excavations along Atlantic Avenue to ribbon-cuttings at Dorchester's Codman Square Health Center. The problem: nobody knows exactly how many of those images exist in duplicate, triplicate, or worse. A review begun in late 2025 by the City of Boston Archives office found that at least 40 percent of digitized image files across city departments share identical or near-identical content under different file names — a bureaucratic tangle years in the making.
The timing matters. Mayor Michelle Wu's administration has pushed hard on open-data initiatives since 2022, including the expanded Boston Data Portal that gives residents and researchers access to city records. Redundant image files jam that system, slow search functions, and — critically — make it harder to verify which version of a photograph is the authoritative one when records are requested under the state's Public Records Law. The cleanup effort, which archivists are calling a duplicate image replacement project, is the unglamorous back-end work that open-government promises require.
How the Mess Got Made
The roots go back to roughly 2013, when individual city departments began scanning their own photo collections independently. The Boston Planning Department — then still operating as the Boston Redevelopment Authority — ran its own digitization drive. The Parks and Recreation Department did the same for Jamaica Plain's Franklin Park and the Arnold Arboretum corridor. The Public Works Department scanned construction documentation from projects along Melnea Cass Boulevard. None of these efforts used a common file-naming protocol or a shared metadata standard. Files landed in separate departmental servers, duplicates multiplied every time an image was emailed between offices, and no single registry tracked what existed where.
By 2019, the city's IT department had consolidated some of those servers onto a shared network, but the consolidation itself generated a new round of duplicates as files were copied rather than moved. The Boston City Archives, based at City Hall on Cambridge Street in Government Center, flagged the problem internally that year. It took until the 2025 fiscal year budget cycle — which allocated $380,000 for records digitization and management improvements — for the project to receive dedicated staffing and software licensing.
The software now being piloted uses perceptual hashing, a technique that identifies visually identical or near-identical images even when file names differ. Early runs across the Parks Department's collection alone turned up more than 8,200 duplicate pairs. Archivists are working through collections department by department, beginning with Public Works and Planning, before moving to smaller agencies.
What Replacement Actually Means
The phrase "duplicate image replacement" is a bit misleading to outsiders. Archivists are not deleting files outright. Instead, each identified duplicate cluster gets a single canonical record assigned a standardized identifier, with the redundant copies flagged and eventually migrated to cold storage rather than the active search index. The process matters for the Boston Data Portal because portal users searching for, say, construction photographs from the Fairmount Corridor rail project in Dorchester currently encounter the same image returned multiple times under different search results — a user-experience failure that erodes trust in the system.
The Massachusetts Secretary of State's office, which sets public records standards for municipalities, requires that electronic records be maintained in a way that ensures authenticity and retrievability. Duplicate proliferation creates a compliance exposure: if two versions of an official photograph differ even slightly — a crop, a color adjustment — it becomes unclear which is the record of authority in a legal context.
Archivists expect to complete the first phase, covering the seven largest city departments, by the end of calendar year 2026. A second phase, targeting smaller boards and commissions, is tentatively scheduled for 2027. For residents who regularly file public records requests — housing advocates tracking development in Jamaica Plain, journalists researching infrastructure decisions — the practical benefit will be faster response times and cleaner results. The city's archives office has asked departments to begin tagging new photographs at the point of creation using a metadata template issued in March 2026. Whether that discipline holds once the immediate project pressure eases is the real test of whether this fix lasts.