The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering

A quiet data crisis is costing Boston's public institutions real money and storage capacity, and the scale of the problem is only now coming into focus.

By Boston News Desk · Published 4 July 2026, 2:44 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering
Photo: Photo by Mohammed Abubakr on Pexels

Boston's universities, city agencies, and cultural institutions are sitting on a growing mountain of redundant digital image files — duplicate photos, scanned documents stored twice or three times over, and backup copies that were never purged — and the combined storage drain runs into the terabytes across publicly funded systems. The problem, long dismissed as a housekeeping nuisance, has attracted renewed scrutiny from IT administrators and archivists who say the financial and operational costs are no longer trivial.

The timing matters. The Wu administration's push to digitize city services and expand open-data infrastructure — part of its broader technology agenda tied to the Boston Digital Equity Initiative — means more images, more documents, and more institutional data flowing into city-managed servers than at any point in Boston's history. Adding duplicate images into that pipeline multiplies the waste.

What the Data Actually Shows

Across large research universities, industry analysts estimate that between 20 and 30 percent of stored image files are redundant — meaning exact or near-exact duplicates that consume storage without adding informational value. Applied to an institution the size of Boston University, which operates one of the largest private university networks in New England with campuses stretching along Commonwealth Avenue from Kenmore Square to Packard's Corner, that figure represents a substantial and measurable drain. Storage costs for enterprise-grade systems currently run roughly $50 to $150 per terabyte per month depending on redundancy tiers and cloud versus on-premises configurations — figures drawn from published vendor pricing from AWS and Microsoft Azure as of mid-2026.

The Boston Public Library's Digital Repository, headquartered at the central branch on Boylston Street in Copley Square, has been working since 2023 to audit its holdings under a grant-funded digitization project. The library's publicly stated collection includes hundreds of thousands of image files drawn from historical photograph archives, and archivists there have acknowledged in public presentations that deduplication is an ongoing challenge rather than a solved problem. The BPL did not respond to a request for comment by press time.

At the neighborhood level, the problem shows up in city planning workflows. The Boston Planning Department, which manages image-heavy zoning files and site documentation for major development corridors including Washington Street in Dorchester and the ongoing Jamaica Plain-Roxbury Connectivity project, generates thousands of new image assets annually. Without automated deduplication protocols baked into document management systems, those files accumulate in ways that are difficult to retroactively audit.

Why Deduplication Is Harder Than It Sounds

Removing duplicate images is not as simple as running a search. Images scanned at different resolutions, cropped differently, or saved in different file formats — JPEG versus TIFF versus PNG — may be functionally identical but won't register as duplicates under basic hash-matching algorithms. Perceptual hashing tools, which compare images based on visual similarity rather than exact byte-for-byte matching, are more effective but require computational overhead and staff time to implement correctly.

The Massachusetts Institute of Technology, whose sprawling Kendall Square campus in Cambridge sits just across the Charles River and whose researchers frequently collaborate with Boston-based institutions, has published open-source tools for large-scale image deduplication through its Computer Science and Artificial Intelligence Laboratory. Those tools are publicly available but require technical capacity to deploy — a barrier for smaller city agencies or nonprofits that lack dedicated IT staff.

For institutions looking to act, archivists and data managers recommend starting with a storage audit benchmarked against total file counts, not just raw gigabytes. Identifying the highest-volume image ingestion points — document scanners, grant-funded digitization projects, event photography workflows — is a faster path to measurable reduction than trying to audit entire legacy archives at once. Cloud storage providers including Google and Microsoft now offer built-in deduplication reporting tools within their enterprise tiers, and the City of Boston's existing contract with cloud infrastructure vendors could, in principle, be leveraged to begin that work without new procurement. The question is whether anyone in a position to act treats it as a priority before the next wave of digitization adds another layer to the pile.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.