The Daily Boston

Boston news, every day

News

Boston's Digital Archives Have a Duplicate Image Problem — and the Numbers Are Staggering

From the Boston Public Library to city government servers, redundant image files are quietly eating storage budgets and slowing down public-records access across the region.

By Boston News Desk · Published 4 July 2026, 2:45 pm

3 min read

Boston's Digital Archives Have a Duplicate Image Problem — and the Numbers Are Staggering
Photo: Photo by Phil Evenden on Pexels

Boston's public institutions are sitting on a sprawling, largely unmeasured problem: duplicate digital images lodged inside municipal databases, library archives, and university repositories that are costing taxpayers real money and degrading search performance across dozens of systems. A push is building inside City Hall and at several Roxbury and Downtown Crossing-area agencies to quantify the waste before the next budget cycle.

The timing matters because Mayor Michelle Wu's administration has made open data and digital service modernization a recurring budget priority since 2022, committing the city to expanding its Analyze Boston data portal and pushing departments toward leaner, more accessible record-keeping. When redundant image files clog those systems, the downstream cost shows up in slower public records responses, inflated cloud storage contracts, and IT staff hours spent on manual cleanup — none of which shows up neatly on a line item.

What the Data Actually Shows

Industry benchmarks from enterprise content management studies suggest that between 20 and 40 percent of files in any large institutional digital repository are exact or near-exact duplicates — a figure that translates, in a mid-size city government context, to tens of thousands of redundant image files. The Boston Public Library's Digital Repository, which holds digitized collections spanning photographs, maps, and archival documents from its Copley Square headquarters, has grown substantially over the past decade as grant-funded digitization projects added material faster than deduplication protocols were written.

The city's IT department tracks overall data storage consumption across municipal servers, but duplicate image detection has not historically been a line item in the annual technology budget presented to the Boston City Council. That gap is significant. Cloud storage pricing for municipal governments typically runs between $0.02 and $0.05 per gigabyte per month through standard government procurement contracts, meaning even a 10-terabyte pool of duplicate image data could represent a recurring annual cost well north of $2,400 per year — modest in isolation, but multiplied across the BPL, the Inspectional Services Department, the Boston Planning and Development Agency, and the Archives and Records Management division, the figures compound quickly.

Northeastern University's library system on Huntington Avenue and the UMass Boston archive at Columbia Point have separately grappled with the same issue on the academic side. Both institutions run digital preservation programs that ingest images from multiple sources, creating the conditions for duplication when digitization batches overlap or when donated collections arrive with pre-existing copies of materials already held on local servers.

Who Is Trying to Fix It

The practical approaches being explored involve automated hashing tools — software that generates a unique fingerprint for each image file and flags matches — and metadata audits that cross-reference file creation dates, sizes, and pixel dimensions. These tools have matured significantly since 2020 and can now process tens of thousands of image files per hour on standard server hardware. The Massachusetts Executive Office of Technology Services and Security, which sets statewide IT standards, issued updated data governance guidelines in early 2025 that encourage agencies to adopt deduplication workflows as part of routine storage hygiene, though compliance at the municipal level remains voluntary.

For residents who use city digital services — pulling permits through the Inspectional Services portal on City Hall Plaza, searching historic photographs through the BPL's online catalog, or accessing BPDA planning documents — the practical effect of unresolved duplication is slower search results and occasionally conflicting file versions surfacing for the same record. That is the piece of the problem that tends to move administrators, even when raw storage costs do not.

City departments now have until the start of fiscal year 2027, which begins July 1 of next year, to submit updated digital asset inventories as part of the Wu administration's broader IT modernization review. Whether duplicate image cleanup earns its own budget line by then will depend on how clearly department heads can put a dollar figure on the inefficiency — which is, somewhat circularly, itself a data problem worth solving first.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.