The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Costly Story

From city hall records to university photo libraries, redundant digital files are eating storage budgets and slowing down the institutions that run Boston.

By Boston News Desk · Published 4 July 2026, 2:48 pm

4 min read

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Costly Story
Photo: Photo by Richard Lathrop on Pexels

Boston's public institutions collectively store tens of millions of digital image files across municipal servers, university archives, and MBTA infrastructure databases — and a growing share of those files are exact or near-exact duplicates that cost money, slow systems, and complicate records management. The problem is not abstract. It has a price tag.

Digital asset management consultants who work with New England universities and municipal governments estimate that duplicate image files typically consume between 20 and 40 percent of an organization's total image storage capacity. For a mid-size city agency running a 50-terabyte document archive, that translates to anywhere from 10 to 20 terabytes of redundant data — storage that must be licensed, backed up, and maintained on an ongoing basis.

The timing matters because Boston is in the middle of an aggressive digitization push. Mayor Michelle Wu's administration has prioritized open-data access and digital transparency across city departments, while institutions like Massachusetts General Hospital on Fruit Street and Northeastern University on Huntington Avenue have expanded their digital imaging infrastructure significantly since 2022. More images entering systems means more duplicates accumulating — and more budget pressure on the IT teams managing them.

Where the Clutter Lives

The MBTA's infrastructure documentation library is one concrete example of where duplicate-image sprawl creates operational friction. Engineers photographing track conditions, signal equipment, and station facilities along the Orange Line corridor — from Forest Hills in Jamaica Plain through downtown — routinely upload images from multiple devices and field teams. Without automated deduplication tools running at ingestion, the same pothole or cracked tile can exist as four or five separate files across different project folders, each tagged differently and none of them flagged as redundant.

The Boston City Archives, located on City Hall Plaza, faces a parallel challenge with historical photograph collections that have been scanned multiple times as technology improved. A single glass-plate negative from the 1910s might exist as a 300 dpi scan, a 600 dpi rescan, a JPEG derivative, and a web-optimized thumbnail — four files, one image, and no automated system to link them as relatives rather than strangers. Multiply that pattern across 150 years of civic photography and the storage math becomes uncomfortable quickly.

Commercial cloud storage currently runs between $0.02 and $0.023 per gigabyte per month for enterprise accounts on major platforms. An institution sitting on 15 terabytes of duplicate image data is therefore spending roughly $300 to $345 every single month — more than $4,000 a year — to store files it does not need.

What Deduplication Actually Costs to Fix

The remediation side has its own numbers. Perceptual hashing tools — software that identifies visually similar images even when file names and metadata differ — are available at the enterprise level for licensing fees that typically start around $8,000 annually for mid-size deployments. Open-source alternatives like PhotoDNA derivatives exist but require dedicated IT staff hours to implement and maintain. For a city agency without a dedicated digital asset management team, the realistic cost of a full deduplication audit and cleanup project runs between $25,000 and $60,000 when contractor hours are included.

Harvard University's Weissman Preservation Center in Cambridge, which advises cultural institutions across New England on digital stewardship, has published guidance recommending that organizations audit image collections for redundancy at least every 18 months. Most municipal agencies in Massachusetts do not meet that standard, according to state digital records guidelines updated in March 2025.

Boston's biotech corridor along Binney Street in Cambridge adds another layer. Pharmaceutical and research companies store enormous volumes of microscopy and clinical imaging data, and regulatory compliance requirements under FDA 21 CFR Part 11 mean those organizations must retain certain records — but the rules do not require retaining duplicate copies of the same image. Companies that have not built deduplication into their imaging pipelines are paying compliance storage costs for files that provide no additional evidentiary value.

For any Boston institution looking to act before the next budget cycle, the starting point is an ingestion audit — reviewing where images enter the system, how many upload pathways exist, and whether checksums are generated at upload to catch exact duplicates instantly. That step costs almost nothing and can surface the scale of the problem within weeks.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.