The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Full of Duplicate Images. Officials and Experts Say That's a Bigger Problem Than It Sounds.

From City Hall to university libraries in Fenway, administrators and archivists are pushing back against a quiet data crisis hiding inside Boston's public records systems.

By Boston News Desk · Published 4 July 2026, 3:45 pm

3 min read

Boston's Digital Archives Are Full of Duplicate Images. Officials and Experts Say That's a Bigger Problem Than It Sounds.
Photo: Photo by Dominik Gryzbon on Pexels

Boston's municipal and institutional databases are carrying thousands of redundant image files—duplicate photographs, scanned documents, and digital records stored multiple times across overlapping systems—and the people responsible for managing those archives say the cleanup is long overdue.

The issue has surfaced repeatedly in conversations around Mayor Michelle Wu's broader push to modernize city operations, particularly as her administration has invested in digitizing public records from departments including the Boston Inspectional Services Division and the Office of Housing Stability. When files are ingested from legacy systems, duplicates often travel with them, quietly inflating storage costs and making records searches slower and less reliable.

Why This Matters Right Now

The timing is not coincidental. The city is in the middle of a multi-year effort to consolidate housing permit records and inspection photos from neighborhoods like Jamaica Plain and Dorchester—two areas that have seen heavy construction activity under the Wu administration's affordable housing production goals. When the same property photograph exists in three separate folders under slightly different filenames, staff pulling records for a zoning hearing or a legal dispute can pull the wrong version, or miss context altogether.

Librarians and data professionals at institutions including Northeastern University's Snell Library and the Boston Public Library's Digital Repository Service have dealt with the problem for years on the academic side. The challenge is well-documented in archival science: without a systematic deduplication protocol built into the ingest process, collections grow bloated. One widely cited estimate in library science literature suggests that unmanaged digital collections can carry duplicate rates of 15 to 30 percent of total file volume, depending on how many legacy systems were merged during migration.

For a city like Boston, where the IT budget has faced sustained pressure, that kind of redundancy translates into real dollars. Cloud storage costs are not trivial at scale, and the MBTA's own data management overhaul—ongoing since 2023—has included a deduplication pass across its internal image libraries as part of a broader infrastructure modernization effort.

What Experts Are Recommending

The consensus among records managers and civic technologists is that the fix requires two things: a retroactive audit of existing archives, and a mandatory deduplication step built into any future upload or ingest workflow. Neither is simple. Retroactive audits require staff time and, in some cases, human judgment—an algorithm can flag two images as identical based on file hash, but a person has to confirm that neither version contains unique metadata worth preserving.

Organizations like the Metropolitan Area Planning Council, which works with municipalities across Greater Boston, have begun advising member towns and cities to adopt open-source deduplication tools as part of their records management policies. Several of those conversations have been happening through the Council's Digital Services working group, which has met quarterly at its offices on Congress Street in downtown Boston.

On the academic side, digital humanities programs at Boston College and MIT have both piloted workflows using perceptual hashing—a technique that catches near-duplicate images even when files have been slightly resized or recompressed—as part of grant-funded digitization projects. The results, presented at archival conferences in 2024 and 2025, suggested significant reductions in redundant storage within six months of implementation.

For residents, the practical stakes show up in moments like requesting a permit history for a triple-decker on Bowdoin Street in Dorchester, or checking inspection records for a new development near the Green Street MBTA stop in Jamaica Plain. Duplicate and disorganized records slow those searches, and in disputes over housing code compliance or property history, delays have consequences.

City officials have not announced a formal deduplication initiative by name, but the Wu administration's IT modernization roadmap—released in draft form in late 2025—includes language around data integrity and storage efficiency. Advocates who follow municipal technology policy say the next budget cycle, beginning in the fall of 2026, is the likely moment for a concrete program to emerge. Anyone with records requests pending through the city's Constituent Services portal should expect the current system, redundancies and all, to remain in place through at least the end of the year.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.