The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story

From city hall records to university research databases, redundant image files are quietly draining storage budgets and slowing down the institutions that run Boston.

By Boston News Desk · Published 4 July 2026, 2:51 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story
Photo: Photo by Jonathan Fuentes on Pexels

Boston's public institutions are sitting on a problem measured in terabytes. Across municipal databases, university digital archives, and the Massachusetts Bay Transportation Authority's document management systems, duplicate image files — photographs, scanned records, design renderings — have accumulated for years without systematic removal. The financial and operational cost of that neglect is now becoming harder to ignore.

The issue surfaces at a moment when the city's technology infrastructure is under pressure. Mayor Michelle Wu's administration has pushed digitization of city services as part of a broader modernization agenda, moving permitting, inspections, and community-engagement records online. That migration has accelerated the volume of files entering city servers — and, according to digital asset management specialists, tends to multiply duplicate images at roughly three to four times the rate of manual filing systems, because automated uploads rarely include deduplication checks at the point of entry.

What the Data Actually Looks Like

Industry benchmarks published by the Storage Networking Industry Association put the share of redundant files in unmanaged enterprise archives at between 25 and 40 percent. Applied to a mid-sized municipal government like Boston's, that range suggests a meaningful chunk of whatever the city spends annually on cloud and on-premise storage is funding copies of files already in the system. The city's Office of Innovation and Technology oversees digital infrastructure across dozens of departments housed in City Hall on Cambridge Street, but a unified deduplication audit has not been publicly reported as completed.

The problem is not unique to government. Northeastern University's library system, which maintains digital collections across its Snell Library on Huntington Avenue, and the Boston Public Library's Digital Commonwealth project — a statewide repository of digitized historical materials — both operate archives where duplicate image ingestion is a documented challenge. Digital Commonwealth, which hosts more than 1.6 million items from Massachusetts cultural institutions, implemented automated similarity-detection tools starting in 2022, but administrators have acknowledged publicly that legacy collections uploaded before that year remain largely unaudited.

The MBTA faces a parallel issue in its engineering and infrastructure documentation. The authority holds decades of track diagrams, station photographs, and construction records digitized from paper originals. When the Green Line Extension project completed its final segment to Union Square in Somerville in 2022, post-project documentation uploads were reported internally to have generated significant file redundancy — a common outcome when multiple contractors submit overlapping progress-photo sets to a shared repository.

The Cost of Doing Nothing

Storage is not free. AWS S3 cloud storage — the type commonly used by Massachusetts state agencies under the state's MassIT procurement framework — runs at roughly $0.023 per gigabyte per month at standard rates as of mid-2026. For an archive holding 500 terabytes with a 30 percent duplication rate, eliminating redundant files would represent potential savings exceeding $40,000 annually on storage alone, before accounting for reduced backup times and faster search performance.

Boston's biotech corridor along Longwood Avenue and Binney Street in Cambridge generates its own version of the same problem. Research institutions, including those affiliated with Harvard Medical School and the Broad Institute, maintain imaging datasets — microscopy photographs, clinical scan exports — where duplication rates in unmanaged systems have been measured at 20 to 35 percent in peer-reviewed data management studies published in journals like Scientific Data.

The practical remedies are well-established. Perceptual hashing algorithms can identify visually identical or near-identical images even when file names differ. Tools built on that approach, including open-source options like ImageDedup, can process tens of thousands of files per hour on standard server hardware. The obstacle in most institutional settings is not technology — it is the organizational decision to schedule and fund an audit in the first place.

For Boston's city departments, that decision sits with the Office of Innovation and Technology in coordination with each department's records manager. Institutions running their own archives — the BPL, Northeastern, the MBTA — will each need to build deduplication into standard ingestion workflows rather than treating it as a one-time cleanup project. The longer the audit is deferred, the larger the redundant archive grows, and the more expensive the eventual remediation becomes.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.