Tens of thousands of duplicate image files are clogging the digital archives maintained by Boston-area public institutions, universities, and city agencies — and the scale of the problem, measured in petabytes of wasted storage and hundreds of thousands of misfiled records, is only now coming into focus as archivists begin systematic audits.
The timing matters. Boston's public-sector digitization push accelerated sharply after the COVID-19 pandemic forced city hall and affiliated agencies to move records online. The Mayor's Office of Digital Innovation, which sits under City Hall on City Hall Plaza, oversaw a significant expansion of cloud-based document storage between 2020 and 2024. The unintended consequence: scan-and-upload workflows that prioritized speed over deduplication left institutional repositories bloated with redundant files that now complicate public records requests, slow search tools, and inflate licensing costs for cloud storage.
What the Data Actually Shows
The duplicate image problem is not unique to Boston, but local repositories illustrate the numbers with unusual clarity. The Boston Public Library's Digital Commonwealth platform — a statewide collaborative that BPL anchors from its Copley Square headquarters — hosts more than 1.4 million digitized objects as of mid-2026. Archivists working on a 2025 internal quality review identified that roughly 8 to 12 percent of image assets in some collection batches were exact or near-exact duplicates, meaning the practical working collection is meaningfully smaller than the headline figure suggests. Digital Commonwealth has not published a formal duplicate rate across its full holdings, and the figures cited here reflect ranges described in publicly available conference presentations by digital preservation specialists at regional library consortia.
At Northeastern University's Snell Library on Huntington Avenue, a similar audit of its Archives and Special Collections digitization pipeline found that batch scanning of physical photograph collections — particularly those donated without original cataloging — produced duplicate-image rates that required manual review of thousands of files before metadata could be reliably assigned. Northeastern has not released a specific duplicate count publicly, but the library has presented on the challenge at Digital Library Federation forums.
Storage costs compound the problem. Amazon Web Services S3 storage, which multiple Boston-area institutions use for archival tiers, runs roughly $0.023 per gigabyte per month for standard storage as of 2026 pricing. A repository carrying even 500 gigabytes of genuinely redundant image data — a conservative estimate for a mid-size institutional archive — pays roughly $138 a year for files that serve no retrieval purpose. Across dozens of city and university systems, those figures aggregate quickly.
What Deduplication Actually Requires
Fixing the problem is not simply a matter of running a script. Perceptual hashing tools — software that generates a fingerprint for each image and flags near-matches — can catch obvious duplicates, but archivists at the Boston City Archives, located in West Roxbury, have flagged a harder challenge: images that are technically distinct files but represent the same physical artifact photographed twice under different lighting conditions. Both files may carry independent catalog records, meaning automated deletion risks destroying metadata that took hours to create.
The Massachusetts Board of Library Commissioners has included deduplication guidance in its digital preservation planning resources, and several regional library networks have begun coordinating on shared tooling. The IMLS-funded ReDiscovery Project, a national initiative with participating institutions in New England, is piloting workflow standards that embed deduplication checks at the point of ingest rather than as a retroactive cleanup task — a change that archivists say is the only sustainable fix.
For Boston institutions moving forward, the practical priority is clear: build deduplication into digitization contracts before scanning begins. City agencies preparing bids for new scanning projects — including ongoing work tied to the Wu administration's open-data commitments — should require vendors to deliver MD5 or SHA-256 checksums alongside every image file, creating a verifiable record that duplicate detection actually occurred. Without that requirement written into procurement language, the next round of audits will find the same problem waiting.