The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — Here's What the Numbers Reveal

A quiet data-quality crisis is costing city institutions storage dollars and researcher hours, and the problem is bigger than most administrators want to admit.

By Boston News Desk · Published 4 July 2026, 3:00 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — Here's What the Numbers Reveal
Photo: Photo by Harrison Haines on Pexels

Boston's public institutions are sitting on a surprisingly large and expensive mess. Duplicate image files — identical or near-identical photographs, scans, and graphics stored multiple times across separate servers — have accumulated inside municipal databases, university digital libraries, and nonprofit archives at rates that specialists say are well above national benchmarks for comparably sized cities. The problem isn't new, but a push toward unified digital infrastructure across city departments in 2025 and 2026 has put hard numbers on a problem that was previously mostly anecdotal.

The timing matters for reasons that go beyond IT housekeeping. Mayor Michelle Wu's administration has been consolidating city data systems as part of a broader open-government initiative, and that consolidation is forcing a reckoning with redundant files that were never caught during earlier digitization drives. When libraries, planning offices, and housing agencies all uploaded the same streetscape photographs during separate projects over the past decade, no single department was tracking the overlap. Now someone has to.

The Scale of the Problem, by the Numbers

Digital archivists working with the Boston Public Library's Leventhal Map and Education Center on Boylston Street have identified duplication rates that can run as high as 30 percent in collections that were digitized in multiple phases — meaning roughly one in three image files in some batches is a copy of something already in the system. Storage costs for large image files, particularly high-resolution TIFFs used in archival work, typically run between $0.02 and $0.05 per gigabyte per month on cloud platforms, a figure that compounds quickly when a single institution holds tens of thousands of redundant files. The Leventhal Center alone manages more than 10,000 digitized map sheets, and that collection has gone through at least three separate ingest processes since 2015.

Northeastern University's library system on Huntington Avenue faces a structurally similar challenge. The university's digital repository has absorbed collections from multiple academic departments and community partners, and staff have described the deduplication process as labor-intensive — a researcher must often open files side by side to confirm they are true duplicates rather than slightly different crops or resolution variants. Manual review at scale is not cheap: archival labor in Boston runs roughly $25 to $40 an hour for qualified metadata specialists, and a mid-sized collection cleanup can take hundreds of staff hours before automated tools can be deployed reliably.

Automated deduplication software has improved substantially, with tools now capable of flagging perceptual duplicates — images that look identical to a human eye but differ in file size or compression — with accuracy rates above 95 percent in controlled tests. The catch is that those tools require clean, consistent metadata to work efficiently, and that is precisely what legacy Boston collections often lack. Images ingested from the city's neighborhood planning processes in Jamaica Plain and Dorchester in the early 2010s, for example, were frequently uploaded without standardized file-naming conventions, leaving automated systems struggling to match records that a librarian could identify as the same photograph in under ten seconds.

What Comes Next for City Collections

The Wu administration's Digital Equity and Data Governance framework, which city departments began operating under in fiscal year 2026, includes a directive for agencies to audit file storage and eliminate redundancies before migrating to the new consolidated cloud environment. Departments have until the end of calendar year 2026 to complete initial audits. For large cultural institutions that operate independently of city government, there is no mandatory deadline, but the Boston Public Library has indicated it plans to align its own deduplication timeline with the city's schedule.

For researchers and community members who rely on these archives — genealogists pulling records from the West End Museum, planners referencing historical streetscape images for Jamaica Plain rezoning hearings — the practical effect of a successful cleanup would be faster search results and fewer dead-end duplicates clogging results pages. The less visible benefit is financial: storage budgets freed from redundant files can be redirected toward digitizing materials that have never been online at all. That is the argument archivists are making to administrators right now, and the numbers, for once, are squarely on their side.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.