Boston's digital archives have a redundancy problem. City agencies, university libraries, and biotech research institutions across the metro area are carrying thousands of duplicate image files in their public-facing and internal databases — a technical shortcoming that wastes storage, distorts search results, and in some cases buries authentic historical records under layers of repeated scans. This week, the issue moved from the backlog into active conversation at City Hall and along the Longwood Medical Area corridor.
The timing matters. Mayor Michelle Wu's administration has been pushing a broader open-data and civic transparency agenda since taking office, and the city's IT department has been under pressure to consolidate records ahead of a planned migration to a unified municipal content management platform expected to launch in early 2027. Duplicate imagery — ranging from repeated photographs in the Boston City Archives on School Street to redundant property assessment scans in the Assessing Department's online portal — represents one of the messier line items in that transition.
What Happened This Week
On Wednesday, July 2, staff at the Boston Public Library's Digital Repository Service, based at the central branch on Boylston Street in Copley Square, circulated an internal working document flagging the scale of the duplication issue within their own holdings. The BPL's Flickr-linked collections and the Digital Commonwealth platform — which aggregates digitised materials from libraries and museums across Massachusetts — both carry recurring versions of the same late-19th and early-20th century neighbourhood photographs, particularly images from Dorchester, Jamaica Plain, and the South End.
Separately, Northeastern University's Snell Library, on the main campus in Fenway, reported this week that an audit of its Archives and Special Collections had identified a significant volume of redundant image ingests tied to a 2023 batch upload from a partnering institution. The library's digital services team is now running deduplication scripts against approximately 14,000 image files. The audit was conducted in June; results were shared with staff this week.
On the biotech side, several research data managers working in the Longwood Medical Area flagged that imaging repositories used in clinical trial documentation — governed by federal record-keeping rules under 21 CFR Part 11 — have been generating duplicate entries through automated pipeline errors. Correcting those errors before a records audit is not optional; it carries regulatory consequences.
Why the Fix Is Harder Than It Sounds
Deduplication sounds straightforward. It is not. Standard duplicate-detection tools compare file hashes — essentially a unique fingerprint for each file — but photographs scanned at different resolutions, or images that have been lightly cropped or colour-corrected, will carry different hashes even if they depict the same subject. That means the BPL's records team, for example, cannot simply run a hash-matching script and call the job done. Human review is required for a meaningful portion of the flagged files.
The Digital Commonwealth platform, maintained by the Boston Public Library with support from the Massachusetts Board of Library Commissioners, indexes material from more than 180 contributing institutions statewide. Even a modest duplication rate across that network compounds quickly. The platform currently hosts more than 1.6 million digitised objects, according to figures published on the Digital Commonwealth website.
Storage costs are not trivial. Cloud storage pricing for institutional archives typically runs between $0.02 and $0.05 per gigabyte per month depending on the provider and access tier. High-resolution archival image files commonly run 50 to 100 megabytes each. Multiply redundant copies across a large collection and the monthly bill adds up — money that library and city budgets, already under pressure in fiscal year 2026, could redirect elsewhere.
The practical path forward, according to digital preservation literature and the approaches being discussed at Snell Library and the BPL this week, involves a phased process: automated hash-matching first to catch obvious exact duplicates, followed by perceptual hashing tools that can identify near-duplicates, and finally human curatorial review for anything the algorithms flag as uncertain. For the BPL's Dorchester and Jamaica Plain photograph collections specifically, the library has indicated it intends to have a preliminary deduplication report ready before the end of August. City agencies feeding into the 2027 platform migration have been given a soft deadline of December 31, 2026, to submit clean, audited image inventories.