The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — Here's What the Numbers Actually Show

A growing body of data reveals how redundant visual files are quietly draining storage budgets and slowing down city and university systems across Greater Boston.

By Boston News Desk · Published 4 July 2026, 2:40 pm

4 min read

At least one in five digital image files stored across Boston's major public and academic institutions is an exact or near-exact duplicate, according to an analysis of digital asset management audits conducted by library technology consultants working with several of the city's universities. That figure — 20 percent or higher redundancy in large image repositories — has become a quiet driver of unnecessary IT spending at a moment when every dollar in municipal and university budgets is under scrutiny.

The timing matters. Mayor Michelle Wu's administration has pushed hard on digital equity and government transparency, which means city departments are migrating more records online faster than ever. The Boston Public Library's Copley Square branch alone digitized more than 400,000 archival images between 2022 and 2025 as part of its ongoing Digital Commonwealth partnership with the Massachusetts Board of Library Commissioners. Rapid digitization without duplicate-detection protocols is where the problem compounds.

The Scale of the Problem in Greater Boston

Northeastern University's library systems team flagged the duplicate image problem internally during a 2024 infrastructure review of its Snell Library digital collections on Huntington Avenue. Consultants found that storage consumption attributable to redundant image files was running at roughly 18 terabytes above what the actual unique-image count would justify. At current cloud storage pricing — around $23 per terabyte per month for enterprise-grade archival tiers — that translates to over $400 monthly in avoidable cost for a single institution's collection.

Scale that across the Fenway and Mission Hill corridor, where Wentworth Institute of Technology, Massachusetts College of Art and Design, and Simmons University all run their own digital repositories within a half-mile radius of each other, and the aggregate waste adds up fast. Industry benchmarks from the Digital Preservation Coalition suggest that large research libraries globally carry duplicate-image overhead averaging between 15 and 30 percent of their total visual asset storage — a range that Boston-area auditors say is consistent with what they are finding locally.

The City of Boston's own Department of Innovation and Technology, based in City Hall on Congress Street, began piloting a deduplication protocol across municipal image databases in March 2026. The pilot covers three departments — Public Works, Parks and Recreation, and the Inspectional Services Department — and is expected to produce a formal cost-benefit report by October. No figures from that report have been released publicly yet.

What Drives Duplicate Images — and What Fixes Them

The causes are less exotic than they sound. Staff turnover means the same photograph gets uploaded twice under different file names. Vendor migrations between content management systems — a routine event at places like the Isabella Stewart Gardner Museum on Evans Way or the Massachusetts Historical Society on Boylston Street — routinely generate duplicate files when import tools fail to check existing libraries. And because image files vary in resolution, the same photograph stored at 300 dpi and again at 72 dpi may not be caught by basic duplicate-detection software that relies on exact file-hash matching rather than perceptual hashing algorithms.

Perceptual hashing, which converts images into short numerical fingerprints based on visual content rather than file data, is increasingly the standard recommended by the Library of Congress and adopted by large research universities. MIT Libraries began integrating perceptual hash tools into its DSpace digital repository in 2025, citing both storage efficiency and improved collection integrity as goals.

For smaller Boston organizations without dedicated digital archivists, free tools like the open-source dupeGuru remain a practical starting point. Paid enterprise solutions from vendors such as Widen Collective or Bynder, both of which have clients among Boston's biotech and media sectors, offer automated deduplication at scale but carry licensing costs that can run above $30,000 annually for large repositories.

Institutions sitting on backlogs of unaudited image collections should prioritize a storage audit before the next budget cycle. For Boston city departments, the October report from the Department of Innovation and Technology will set a public benchmark — and likely prompt comparable reviews across the MBTA and the Boston Planning Department, both of which maintain large photographic archives tied to infrastructure and permitting records. The numbers already in hand make the case clearly enough: redundancy is not a minor housekeeping issue. It is a measurable, fixable budget leak.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.