Boston's major cultural and academic institutions are sitting on a problem years in the making. Duplicate image files — thousands of near-identical scans, photographs, and digital records accumulated across decades of digitization drives — are consuming server capacity, inflating storage costs, and making archival search tools slower and less reliable. Now, with several institutions facing contract renewals on cloud storage agreements this fall, the pressure to act is landing squarely on IT directors and archivists across the city.
The issue matters now because the stakes have changed. Through the early 2020s, storage was cheap enough that letting duplicates accumulate felt like a reasonable trade-off against the risk of accidental deletion. That calculus has shifted. Cloud storage pricing has climbed, digitization volumes have exploded across Boston's hospital networks, university libraries, and city government offices, and new federal records-retention guidance issued in early 2026 is forcing institutions to audit what they actually hold before they can certify compliance.
Where the Problem Is Most Acute
At the Boston Public Library on Copley Square, staff have been working since January to audit the digital holdings in the Leventhal Map and Education Center, where duplicate scans of nineteenth-century maps have multiplied across at least three separate digitization projects run by different vendor teams. The library has not publicly disclosed the full scope of the redundancy, but archivists familiar with large-scale digitization efforts say repositories of that age and size commonly carry duplication rates of 15 to 30 percent across their image libraries.
Northeastern University's Snell Library on Huntington Avenue faces a similar reckoning in its Archives and Special Collections division. The university completed a major digital infrastructure migration in 2024, and technical staff are now working through the aftermath — identifying which image files were copied multiple times during the transition and which represent genuinely distinct records. The work is painstaking. Automated deduplication tools can catch identical files quickly, but near-duplicates — a slightly different crop, a rescan at a higher resolution, an image saved under a different filename — require human review.
City Hall's own records management office has flagged duplicate imagery in the Boston Planning and Development Agency's project photo archive, where construction documentation from Jamaica Plain and Dorchester housing developments has been uploaded repeatedly by different contractors using different naming conventions. That archive feeds directly into public-facing transparency portals.
The Decisions That Will Define the Outcome
Three choices will determine how well Boston's institutions handle this over the next twelve months. First, whether to run automated deduplication tools before or after a human-led audit. Running automation first is faster and cheaper, but risks permanently deleting files that look identical but carry distinct metadata — a mistake that cannot be undone in a closed archive. Several peer institutions in New York and Chicago have already learned this lesson expensively.
Second, institutions must decide who owns the deduplication decision. At the Boston Public Library and Northeastern, the tension between IT departments focused on cost efficiency and archivists focused on preservation integrity is real. Neither group has unilateral authority, and without a clear governance structure, projects stall. The Massachusetts Board of Library Commissioners, which oversees standards for public library collections statewide, has indicated it is developing guidance on digital asset management, though no formal policy has been published as of July 4, 2026.
Third, there is the question of what to do with verified duplicates once identified. Deletion is not the only option. Some institutions are opting to migrate lower-resolution duplicates to cold storage rather than destroy them — preserving a fallback while freeing primary server space. Cold storage on major cloud platforms runs significantly cheaper than active storage tiers, making this a defensible middle path for budget-constrained operations.
The MBTA's public communications archive and the City of Boston's 311 photo documentation system are also understood to be under internal review, though neither office has announced a formal deduplication program. Institutions that act before their next cloud contract renewal — most of which cluster around October and November 2026 — will be better positioned to renegotiate pricing based on accurate storage footprints. Those that wait will be negotiating blind.