Boston's sprawling effort to digitize municipal records has surfaced a problem nobody budgeted for: tens of thousands of duplicate images clogging city and institutional archives, slowing search tools, inflating storage costs, and — in some cases — burying the accurate version of a document beneath several near-identical copies. The question of what to do next is now landing on desks across City Hall, the Boston Public Library's Digital Repository, and the city's network of university partners.
The timing matters because the city's digital infrastructure is mid-cycle. Mayor Michelle Wu's administration has pushed hard on open-government data tools as part of a broader transparency agenda, and the Boston Public Library's Copley Square branch recently completed the latest phase of its digitization contract, covering historical property records for neighborhoods including Jamaica Plain and Dorchester — two areas where housing production disputes have made accurate historical title documentation unusually consequential. When duplicate images contaminate those archives, the downstream effects reach planning hearings and community land trust negotiations, not just librarians.
What Duplication Actually Costs
The scale of the problem is not trivial. Cloud storage contracts for municipal archives are typically priced per terabyte, and industry benchmarks suggest duplicate files can account for between 20 and 40 percent of total stored data in large institutional digitization projects — a range that maps onto real budget exposure when contracts run into six figures annually. The City of Boston's Office of Digital Innovation, headquartered at City Hall on Cambridge Street, has not published a specific figure for its current storage overhead, but the question of deduplication methodology has appeared on the agenda of at least two recent vendor review sessions, according to public procurement notices posted to the city's procurement portal this spring.
The Boston Public Library is working with Internet Archive, the San Francisco-based nonprofit that hosts the Open Library project, as part of its ongoing digitization partnership. That relationship means Boston's deduplication choices don't stay local — they feed into a shared infrastructure used by hundreds of institutions. Get the metadata wrong during a bulk deletion, and a unique scan disappears from a catalog that researchers in Roxbury and researchers in Rotterdam both rely on.
At Northeastern University's Digital Scholarship Group, based on Huntington Avenue, staff have been developing hashing-based tools — software that generates a unique fingerprint for each image file — to flag probable duplicates before any human review. The approach is faster than manual comparison but still requires a librarian or archivist to make the final call on deletion, because two files can look identical to an algorithm while representing different photographic exposures of the same document, each carrying distinct legibility value.
The Decisions That Can't Wait
Three choices will define how this plays out over the next 12 months. First, the city must decide whether deduplication will be centralized — run through a single vendor or city office — or distributed, with each agency managing its own records. The centralized model is cheaper but requires trust in a single point of failure. Distributed control fits Boston's departmental culture better but almost guarantees inconsistent standards.
Second, there is the question of public notice. When a duplicate image is deleted from a public archive, does the deletion get logged in a way that a citizen can audit? Civil liberties advocates, including staff at the ACLU of Massachusetts on Winter Street, have argued in other contexts that deletion logs are essential accountability infrastructure. Whether that principle extends to image archives hasn't been formally tested here.
Third — and most immediately — the BPL and the Office of Digital Innovation need to agree on a shared metadata standard before the next digitization phase begins in the fall. Without that agreement, the current duplication problem will reproduce itself in the next tranche of records, which is slated to include Dorchester neighborhood planning documents dating to the 1970s urban renewal era.
The Fourth of July holiday weekend means key decision-makers are out of reach until at least Monday, July 6. The next scheduled meeting of the city's Digital Services Advisory Group is set for July 15 at City Hall. Whatever framework emerges from that session will either put the archive cleanup on a defined track or kick the hard choices into budget season — when the political calendar tends to swallow technical problems whole.