The Daily Boston

Boston news, every day

News

Boston's Digital Archives Have a Duplicate Image Problem — Here's What Officials and Experts Are Saying

From City Hall records to university libraries, the push to clean up redundant digital files is drawing attention from archivists, technologists, and municipal administrators across Greater Boston.

By Boston News Desk · Published 4 July 2026, 3:06 pm

3 min read

Boston's Digital Archives Have a Duplicate Image Problem — Here's What Officials and Experts Are Saying
Photo: Photo by Phil Evenden on Pexels

Boston's public institutions are sitting on thousands of duplicate digital images — redundant files clogging servers, inflating storage costs, and muddying public records — and the people responsible for managing those archives say the problem has reached a tipping point. From the Boston City Archives on School Street to the digital collections at the Boston Public Library's Central Branch on Boylston Street, administrators are grappling with how to identify, flag, and responsibly remove identical or near-identical image files that have accumulated over years of disorganized uploads and migrations.

The issue matters now because several of Boston's largest institutions are in the middle of, or approaching, major digital infrastructure transitions. The MBTA, which has been under sustained pressure to modernize its operations after years of reliability failures, began digitizing maintenance and incident documentation in earnest in 2023. The Boston Public Schools system has been building out a centralized digital asset system as part of a broader records modernization effort. Both processes have produced exactly the kind of unstructured bulk uploads where duplicate imagery proliferates fastest.

What the Archivists and Technologists Are Saying

Professionals working in digital preservation have been vocal about the operational consequences. Duplicate images don't just waste storage capacity — they create compliance headaches under Massachusetts public records law, slow down search retrieval times, and introduce confusion when staff pull files for official use. Archivists at the Northeastern University Libraries on Huntington Avenue, which manages one of the region's more sophisticated digital collections, have been experimenting with perceptual hashing tools that can detect near-duplicate images even when file names or metadata differ. The technology compares visual fingerprints rather than raw file data, catching the kind of subtle variations — a slightly different crop, a recompressed JPEG — that traditional deduplication software misses.

At the municipal level, the city's Department of Innovation and Technology, which falls under Mayor Michelle Wu's administration and has been aligned with her broader push for operational transparency, has acknowledged the issue as part of a wider data governance review. City officials have not put a public dollar figure on the storage costs involved, but industry benchmarks suggest large municipal governments routinely spend between $200,000 and $600,000 annually on cloud and on-premises storage, with redundant files often accounting for 20 to 40 percent of total data volume — figures that translate to tens of thousands of dollars in avoidable expense.

The Boston Public Library system, which digitized more than 180,000 items from its Norman B. Leventhal Map & Education Center collection alone, has flagged duplication as a recurring byproduct of multi-phase scanning projects where the same item gets processed by different vendors or at different resolutions. Librarians working on the BPL's Digital Commonwealth portal — a shared platform used by institutions across Massachusetts — have pushed for metadata standards that would make duplicates easier to catch before they enter the archive rather than after.

What Comes Next for Boston's Institutions

The practical path forward, according to digital preservation professionals, involves three distinct phases: automated detection using hashing and machine-learning comparison tools, human review of flagged files to avoid deleting genuinely distinct images that merely look similar, and updated intake protocols that require deduplication checks before new batches are added to any live archive. Several Boston-area universities — including Emerson College on Boylston Street and UMass Boston on Columbia Point — have already begun piloting intake workflows that build the check into the upload process rather than treating it as a cleanup job.

For residents and researchers who rely on the city's public-facing digital archives, the immediate practical advice is straightforward: if a search on the City of Boston's open data portal or the BPL's Digital Commonwealth site returns multiple versions of what appears to be the same image, report it through the platform's feedback mechanism. Those flags feed directly into the review queues that archivists use to prioritize their deduplication work. The long-term fix is a governance one — consistent standards, enforced at the point of upload, that prevent the problem from growing faster than anyone can clean it up.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.