Boston's public institutions are sitting on a digital storage problem years in the making. Duplicate image files — the same photograph indexed two, three, sometimes a dozen times across mismatched database systems — have quietly ballooned into a measurable drain on city and university IT budgets, according to internal audits reviewed by The Daily Boston and conversations with archivists working inside several affected organizations.
The problem matters now because city agencies, from the Boston Public Library's Digital Repository on Boylston Street to the Mayor's Office of Arts and Culture, have been accelerating digitization drives since 2022. More scanning means more files. More files, without a coordinated deduplication protocol, means geometric growth in redundancy. One mid-sized municipal archive can generate north of 40,000 image assets per digitization cycle, and without automated duplicate detection, staff are manually reconciling records that should have been caught at the ingest stage.
The Scale of the Problem in Boston's Institutions
The Boston Public Library's Digital Commonwealth platform, a statewide initiative administered through a partnership with the Massachusetts Board of Library Commissioners, hosts more than 1.7 million digital objects as of its most recent public count. Staff there have acknowledged that duplicate ingestion — particularly from partner libraries uploading collections independently — has been a recurring data quality challenge. Northeastern University's Digital Repository Services, based on the Huntington Avenue campus in Fenway, faces a parallel issue: collections donated by multiple sources often arrive with overlapping photographs already processed by the donor institution, creating redundant master files that eat into allocated storage quotas.
Storage is not free. Cloud archiving costs for high-resolution TIFF image files, the archival standard, typically run between $0.02 and $0.05 per gigabyte per month depending on the vendor tier. A single undeduplicated collection of 50,000 images, each averaging 30 megabytes, consumes roughly 1.5 terabytes. Run the numbers: that's potentially $75 a month in pure storage overhead for one redundant set — multiplied across dozens of institutional collections, and compounded over years, the figure becomes significant against flat or shrinking digital infrastructure budgets.
The MBTA's internal communications archive, maintained separately from the public-facing system and used for engineering and planning records, reportedly underwent an image audit in late 2024 as part of a broader records modernization effort tied to the agency's ongoing reliability reform push. The audit scope included photographic documentation of track infrastructure and station conditions at stops including Back Bay, JFK/UMass, and Forest Hills. The precise volume of duplicates identified has not been made public, but the audit itself signals that even transit agencies are now treating redundant image data as an administrative liability rather than a benign byproduct of documentation work.
What Deduplication Actually Costs — and What It Saves
Fixing the problem is not simple. Automated deduplication software — tools that hash image files and flag exact or near-exact matches — can process large collections quickly, but institutions must first decide which version of a duplicate is the canonical record. That decision requires human review. Archivists at institutions like the Boston Athenaeum on Beacon Street or the Massachusetts Historical Society on Boylston Street have long understood that two photographs that appear identical to a computer algorithm may have different provenance metadata that makes each archivally distinct.
The practical math still favors intervention. A deduplication pass on a 500,000-asset collection that removes even 8 percent of files — a conservative estimate for institutions that have never run such a process — eliminates 40,000 objects. At 30 megabytes each, that's 1.2 terabytes of recovered storage and a cleaner catalog that researchers, journalists, and the public can actually navigate without wading through identical results.
For institutions planning their next digitization cycle, archivists recommend building deduplication checks directly into the ingest workflow rather than retrofitting them afterward. The Digital Commonwealth platform has published metadata quality guidelines that address this, and the Massachusetts Board of Library Commissioners runs periodic training sessions for partner institutions. The next scheduled training series is expected in fall 2026. Waiting until a collection reaches crisis scale makes the remediation job exponentially harder — and the storage bills in the meantime don't pause.