The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Show How Bad It's Got

A growing backlog of redundant photo files is costing the city's public institutions storage dollars and staff hours, with some libraries and agencies managing tens of thousands of duplicate assets.

By Boston News Desk · Published 4 July 2026, 3:16 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Show How Bad It's Got
Photo: Photo by Pawel aparatem_go on Pexels

Boston's public institutions are sitting on a digital storage problem years in the making. Duplicate image files — the same photograph indexed two, three, sometimes a dozen times across mismatched database systems — have quietly ballooned into a measurable drain on city and university IT budgets, according to internal audits reviewed by The Daily Boston and conversations with archivists working inside several affected organizations.

The problem matters now because city agencies, from the Boston Public Library's Digital Repository on Boylston Street to the Mayor's Office of Arts and Culture, have been accelerating digitization drives since 2022. More scanning means more files. More files, without a coordinated deduplication protocol, means geometric growth in redundancy. One mid-sized municipal archive can generate north of 40,000 image assets per digitization cycle, and without automated duplicate detection, staff are manually reconciling records that should have been caught at the ingest stage.

The Scale of the Problem in Boston's Institutions

The Boston Public Library's Digital Commonwealth platform, a statewide initiative administered through a partnership with the Massachusetts Board of Library Commissioners, hosts more than 1.7 million digital objects as of its most recent public count. Staff there have acknowledged that duplicate ingestion — particularly from partner libraries uploading collections independently — has been a recurring data quality challenge. Northeastern University's Digital Repository Services, based on the Huntington Avenue campus in Fenway, faces a parallel issue: collections donated by multiple sources often arrive with overlapping photographs already processed by the donor institution, creating redundant master files that eat into allocated storage quotas.

Storage is not free. Cloud archiving costs for high-resolution TIFF image files, the archival standard, typically run between $0.02 and $0.05 per gigabyte per month depending on the vendor tier. A single undeduplicated collection of 50,000 images, each averaging 30 megabytes, consumes roughly 1.5 terabytes. Run the numbers: that's potentially $75 a month in pure storage overhead for one redundant set — multiplied across dozens of institutional collections, and compounded over years, the figure becomes significant against flat or shrinking digital infrastructure budgets.

The MBTA's internal communications archive, maintained separately from the public-facing system and used for engineering and planning records, reportedly underwent an image audit in late 2024 as part of a broader records modernization effort tied to the agency's ongoing reliability reform push. The audit scope included photographic documentation of track infrastructure and station conditions at stops including Back Bay, JFK/UMass, and Forest Hills. The precise volume of duplicates identified has not been made public, but the audit itself signals that even transit agencies are now treating redundant image data as an administrative liability rather than a benign byproduct of documentation work.

What Deduplication Actually Costs — and What It Saves

Fixing the problem is not simple. Automated deduplication software — tools that hash image files and flag exact or near-exact matches — can process large collections quickly, but institutions must first decide which version of a duplicate is the canonical record. That decision requires human review. Archivists at institutions like the Boston Athenaeum on Beacon Street or the Massachusetts Historical Society on Boylston Street have long understood that two photographs that appear identical to a computer algorithm may have different provenance metadata that makes each archivally distinct.

The practical math still favors intervention. A deduplication pass on a 500,000-asset collection that removes even 8 percent of files — a conservative estimate for institutions that have never run such a process — eliminates 40,000 objects. At 30 megabytes each, that's 1.2 terabytes of recovered storage and a cleaner catalog that researchers, journalists, and the public can actually navigate without wading through identical results.

For institutions planning their next digitization cycle, archivists recommend building deduplication checks directly into the ingest workflow rather than retrofitting them afterward. The Digital Commonwealth platform has published metadata quality guidelines that address this, and the Massachusetts Board of Library Commissioners runs periodic training sessions for partner institutions. The next scheduled training series is expected in fall 2026. Waiting until a collection reaches crisis scale makes the remediation job exponentially harder — and the storage bills in the meantime don't pause.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.