The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images. Here's What Officials and Experts Are Saying About Fixing It.

From City Hall to the Boston Public Library, administrators and technologists are grappling with a costly, sprawling problem hiding in plain sight inside municipal and institutional databases.

By Boston News Desk · Published 4 July 2026, 2:51 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images. Here's What Officials and Experts Are Saying About Fixing It.
Photo: Photo by Mohammed Abubakr on Pexels

Boston's public institutions are sitting on tens of thousands of redundant digital images — duplicated photographs, scanned documents, and archived visuals spread across servers at city agencies, libraries, and universities — and the effort to clean up those records is now drawing serious attention from technologists, archivists, and city administrators. The problem has become acute enough that the Mayor's Office of New Urban Mechanics, which oversees civic technology initiatives, has begun conversations with outside vendors about automated deduplication tools.

The issue matters now for a specific reason: Boston is mid-way through a multi-year digitization push launched under Mayor Michelle Wu's administration that aims to move city records, historical photographs, and planning documents into publicly accessible online repositories. When duplicate files flood those repositories, storage costs climb, search results degrade, and — critically — the public ends up with an unreliable picture of what the city actually holds.

What the Institutions Are Dealing With

The Boston Public Library's Digital Commonwealth platform, which hosts digitized collections from hundreds of Massachusetts cultural institutions, is one of the largest affected systems in the region. Archivists working with the BPL have described the challenge in public presentations at the Simmons University School of Library and Information Science on The Fenway: when multiple partner institutions contribute scans of the same historical photograph or document, the platform can end up with three or four near-identical image files indexed as separate records. Simmons faculty who specialize in digital preservation have argued in professional forums that the solution is not simply deleting files but building smarter ingest workflows that flag likely duplicates before they enter the system.

At Northeastern University's library on Huntington Avenue, staff managing the university's special collections have piloted a perceptual hashing approach — a technique that generates a compact numerical fingerprint for each image and compares it against existing records — to catch duplicates at the point of upload. The approach is gaining traction in academic library circles because it operates without requiring staff to manually review thousands of files.

City agencies face a parallel but distinct version of the problem. The Boston Planning Department, which absorbed the former Boston Planning and Development Agency in a 2024 reorganization, maintains large internal photo archives documenting construction inspections, permit reviews, and neighborhood surveys across Jamaica Plain, Dorchester, and Roxbury. When staff rotate or projects change hands, the same site photograph frequently gets uploaded multiple times under different file names. That redundancy inflates storage costs — enterprise cloud storage for city government can run from roughly $0.02 to $0.05 per gigabyte per month, and archives can run into hundreds of terabytes — and complicates public records requests.

What Comes Next, and What Experts Recommend

Technologists advising Boston-area institutions broadly agree on a few practical steps. First, any institution running a digitization program needs a deduplication audit before scaling up. Second, automated tools should be embedded into upload pipelines rather than applied retrospectively — retrospective cleanup is exponentially more expensive in staff time. Third, metadata standards need to be enforced consistently so that even when two slightly different scans of the same image exist for legitimate reasons, they can be linked rather than siloed.

The Wu administration has not announced a formal citywide policy on duplicate image management as of July 4, 2026, but the Office of New Urban Mechanics has flagged digital asset governance as part of its broader open-data strategy, which was last updated in the spring. Advocates in the civic-tech community who follow Boston's data initiatives say the window for setting standards is now, while the digitization program is still expanding rather than already complete.

For residents and researchers who use public digital archives — whether pulling historical Dorchester neighborhood photographs from Digital Commonwealth or requesting planning images through the city's public records portal — the practical advice is straightforward: if a search returns suspiciously similar results, report the duplication through the platform's feedback mechanism. Institutions say those user reports are among the most reliable early signals they have that a deduplication problem is growing.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.