The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering

A growing problem in the city's public records and institutional databases is costing storage budgets and burying searchable history under layers of redundant files.

By Boston News Desk · Published 4 July 2026, 3:06 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering
Photo: David Adam Kess / CC BY-SA 4.0 (Wikimedia Commons)

Tens of thousands of duplicate image files are clogging the digital archives maintained by Boston-area public institutions, universities, and city agencies — and the scale of the problem, measured in petabytes of wasted storage and hundreds of thousands of misfiled records, is only now coming into focus as archivists begin systematic audits.

The timing matters. Boston's public-sector digitization push accelerated sharply after the COVID-19 pandemic forced city hall and affiliated agencies to move records online. The Mayor's Office of Digital Innovation, which sits under City Hall on City Hall Plaza, oversaw a significant expansion of cloud-based document storage between 2020 and 2024. The unintended consequence: scan-and-upload workflows that prioritized speed over deduplication left institutional repositories bloated with redundant files that now complicate public records requests, slow search tools, and inflate licensing costs for cloud storage.

What the Data Actually Shows

The duplicate image problem is not unique to Boston, but local repositories illustrate the numbers with unusual clarity. The Boston Public Library's Digital Commonwealth platform — a statewide collaborative that BPL anchors from its Copley Square headquarters — hosts more than 1.4 million digitized objects as of mid-2026. Archivists working on a 2025 internal quality review identified that roughly 8 to 12 percent of image assets in some collection batches were exact or near-exact duplicates, meaning the practical working collection is meaningfully smaller than the headline figure suggests. Digital Commonwealth has not published a formal duplicate rate across its full holdings, and the figures cited here reflect ranges described in publicly available conference presentations by digital preservation specialists at regional library consortia.

At Northeastern University's Snell Library on Huntington Avenue, a similar audit of its Archives and Special Collections digitization pipeline found that batch scanning of physical photograph collections — particularly those donated without original cataloging — produced duplicate-image rates that required manual review of thousands of files before metadata could be reliably assigned. Northeastern has not released a specific duplicate count publicly, but the library has presented on the challenge at Digital Library Federation forums.

Storage costs compound the problem. Amazon Web Services S3 storage, which multiple Boston-area institutions use for archival tiers, runs roughly $0.023 per gigabyte per month for standard storage as of 2026 pricing. A repository carrying even 500 gigabytes of genuinely redundant image data — a conservative estimate for a mid-size institutional archive — pays roughly $138 a year for files that serve no retrieval purpose. Across dozens of city and university systems, those figures aggregate quickly.

What Deduplication Actually Requires

Fixing the problem is not simply a matter of running a script. Perceptual hashing tools — software that generates a fingerprint for each image and flags near-matches — can catch obvious duplicates, but archivists at the Boston City Archives, located in West Roxbury, have flagged a harder challenge: images that are technically distinct files but represent the same physical artifact photographed twice under different lighting conditions. Both files may carry independent catalog records, meaning automated deletion risks destroying metadata that took hours to create.

The Massachusetts Board of Library Commissioners has included deduplication guidance in its digital preservation planning resources, and several regional library networks have begun coordinating on shared tooling. The IMLS-funded ReDiscovery Project, a national initiative with participating institutions in New England, is piloting workflow standards that embed deduplication checks at the point of ingest rather than as a retroactive cleanup task — a change that archivists say is the only sustainable fix.

For Boston institutions moving forward, the practical priority is clear: build deduplication into digitization contracts before scanning begins. City agencies preparing bids for new scanning projects — including ongoing work tied to the Wu administration's open-data commitments — should require vendors to deliver MD5 or SHA-256 checksums alongside every image file, creating a verifiable record that duplicate detection actually occurred. Without that requirement written into procurement language, the next round of audits will find the same problem waiting.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.