The Daily Boston

Boston news, every day

News

Boston's Duplicate Image Problem: The Numbers Driving a City-Wide Digital Cleanup

Municipal archives, university libraries, and biotech firms across Boston are sitting on millions of redundant digital files — and the cost of ignoring them is rising fast.

By Boston News Desk · Published 4 July 2026, 3:12 pm

3 min read

Boston's Duplicate Image Problem: The Numbers Driving a City-Wide Digital Cleanup
Photo: Photo by Jack Sherman on Pexels

Boston's public and private institutions collectively store an estimated 40 percent more digital image data than they actually use, according to industry analyses of municipal and academic archive systems — and the bloat is getting expensive. From the City of Boston's Office of Digital Innovation to the Northeastern University library system on Huntington Avenue, administrators are confronting the same unglamorous problem: duplicate images buried inside sprawling digital repositories are consuming server space, distorting search results, and quietly draining IT budgets.

The timing matters. Mayor Michelle Wu's administration has pushed hard on open-data transparency since 2022, requiring city departments to publish records online. That mandate accelerated file uploads across agencies — but without standardized deduplication protocols, the Boston Planning Department and the MBTA's public-facing media portals both ended up with redundant image libraries that complicate everything from press releases to legal records requests. Combine that pressure with a biotech corridor on Binney Street in Cambridge — where research imaging generates petabytes of data annually — and the scale of the duplicate problem snaps into focus.

What the Data Actually Shows

Storage is not free. Enterprise-grade cloud storage for institutions runs between $0.02 and $0.05 per gigabyte per month, depending on retrieval tiers, according to published pricing from major cloud providers as of mid-2026. For a mid-sized city department sitting on 200 terabytes of image archives — a realistic figure for an agency handling permit photographs, inspection records, and communications assets — duplicate files identified at even a conservative 30 percent redundancy rate translate to roughly 60 terabytes of wasted storage. At $0.03 per gigabyte, that is $1,800 a month, or $21,600 a year, thrown at files that serve no operational purpose.

The Massachusetts Institute of Technology's library digitization program, which has been cataloguing physical collections from the Barker Engineering Library since 2023, flagged duplicate-image rates as high as 22 percent in its first batch of scanned documents — a figure that library science researchers say is consistent with large-scale digitization projects that lack automated hash-matching at the point of ingestion. Hash-matching, a process that assigns each image a unique digital fingerprint and flags identical files before they are saved, costs relatively little to implement but is still absent from many institutional workflows.

The MBTA's communications archive, which houses decades of route maps, station photography, and promotional images, reportedly underwent an internal audit in late 2025 as part of the agency's broader IT modernization effort tied to the Federal Transit Administration's capital investment conditions. The audit's public summary did not disclose specific duplicate rates, but the agency confirmed it was migrating assets to a consolidated digital asset management system.

Where Boston Institutions Are Acting

Several local organizations are already moving. The Boston Public Library's Digital Commonwealth program, headquartered on Boylston Street in Copley Square, has integrated perceptual hashing tools into its ingest pipeline for the Massachusetts Collections Online portal. The technique catches not just exact duplicates but near-identical images — a scanned photograph that was uploaded twice at slightly different resolutions, for example, or a JPEG that was re-saved with minor compression changes.

In Dorchester, the Dudley Street Neighborhood Initiative has been digitizing decades of community planning documents and photographs as part of a 2025 grant from the Institute of Museum and Library Services. Project staff there have adopted open-source deduplication software — tools like digiKam and dupeGuru — rather than commercial platforms, keeping costs manageable on a nonprofit budget.

For any organization wrestling with this now, the practical path forward has three steps. First, run a baseline audit using hash-comparison tools before migrating anything to a new system — moving duplicates is just paying twice. Second, establish a single ingest point with automated flagging baked in, so the problem does not rebuild itself. Third, set a retention policy with defined review dates; a photograph of a Jamaica Plain streetscape from 2018 may be historically valuable, but seventeen identical copies of it are not. The cleanup is unglamorous work, but at $21,600 a year per department in avoidable storage costs, Boston's institutions cannot afford to keep pretending the problem will solve itself.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.