The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — Here's What the Numbers Reveal

From the Boston Public Library to city permit databases, redundant image files are costing municipal agencies real money and real storage capacity.

By Boston News Desk · Published 4 July 2026, 2:45 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — Here's What the Numbers Reveal
Photo: Photo by Harrison Haines on Pexels

Boston's public institutions are sitting on a quiet data problem. Across municipal databases, university digital archives, and city-run permitting portals, duplicate images — identical or near-identical files stored multiple times under different filenames — have quietly inflated storage costs and slowed retrieval systems that workers rely on every day. A growing body of internal audits from comparable mid-sized American cities suggests the redundancy rate in unmanaged image repositories can run as high as 34 percent of total stored files.

The timing matters because Boston is mid-stream on several technology overhauls. The Wu administration's digital services office has been pushing broader modernization across city departments since 2024, including a $14.2 million contract with a Boston-based vendor to upgrade the Inspectional Services Department's permit and record system. That system, which handles everything from Jamaica Plain triple-decker renovation permits to Dorchester zoning variance photographs, ingests thousands of image files monthly. When staff upload site photos without standardized naming conventions, duplicates accumulate fast.

The Storage Math Adds Up Quickly

Storage sounds cheap until it isn't. Cloud storage for enterprise-grade municipal systems typically runs between $0.02 and $0.08 per gigabyte per month, depending on redundancy tier and access frequency. A single high-resolution construction site photograph can weigh 8 to 12 megabytes. Multiply that by the tens of thousands of permit applications filed annually with Boston's ISD — the department processed roughly 42,000 permit applications in fiscal year 2024 — and the duplicate problem translates into measurable line-item waste before any cleanup effort begins.

The Boston Public Library's Digital Commonwealth program, which maintains digitized collections drawn from the BPL's Copley Square flagship and partner institutions across the state, confronted a version of this problem during a 2023 metadata standardization project. Librarians identified thousands of records where scanning workflows had produced multiple copies of the same image at different resolutions, all catalogued as distinct entries. The deduplication effort took more than eight months and required custom scripting built in-house by the library's technology staff.

Universities are not immune. Northeastern University's library system on Huntington Avenue and the MIT Libraries in Cambridge both maintain large-scale image repositories for research and archival purposes. Industry benchmarks from the Digital Preservation Coalition suggest that unmanaged academic image collections develop duplication rates of 20 to 28 percent over a five-year period without automated deduplication tooling in place.

What Deduplication Actually Requires

The technical fix is not complicated, but it demands consistent investment. Perceptual hashing — an algorithmic method that generates a fingerprint for each image based on visual content rather than filename — can flag near-duplicate photos even when they have been resized, recompressed, or renamed. Open-source tools like PhotoDNA and several Python-based libraries can run this kind of sweep across a database of 100,000 images in under three hours on standard server hardware.

The harder problem is governance. Without a policy mandating deduplication at the point of upload, archives regrow redundant files within months of any cleanup. Several peer cities, including Denver and Pittsburgh, have embedded automated hash-checking into their document management systems at the intake stage, preventing duplicates from being committed to storage at all.

For Boston, the practical path forward runs through the city's Department of Innovation and Technology, which oversees interoperability standards across municipal systems, and through the library networks anchored at Copley Square and the branches in Roxbury and East Boston that feed into shared statewide catalogues. Any deduplication protocol adopted for the ISD's permit portal would likely need a parallel version for cultural heritage collections, where a near-duplicate might actually be a historically distinct photograph worth preserving separately.

Residents and small contractors who upload photos through the city's online permitting interface at boston.gov can help at the margins by compressing images before submission and avoiding duplicate uploads when resubmitting applications. But the structural fix has to come from inside city hall — through policy, not user behavior.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.