The Daily Boston

Boston news, every day

News

Boston Is Quietly Leading the Push to Purge Duplicate Images From Public Records — Other Cities Are Watching

As municipalities globally scramble to clean up digitized archives bloated with redundant imagery, Boston's approach is drawing attention from Amsterdam to Seoul.

By Boston News Desk · Published 4 July 2026, 3:00 pm

4 min read

Boston Is Quietly Leading the Push to Purge Duplicate Images From Public Records — Other Cities Are Watching
Photo: Photo by Alexa Heinrich on Pexels

Boston's city archives division has been working since late 2024 to systematically identify and remove duplicate images embedded in digitized public records — a unglamorous but increasingly urgent task as the volume of scanned documents in municipal databases has ballooned past manageable limits. The effort, centered at City Hall on Cambridge Street and coordinated with the Boston City Archives office in West Roxbury, puts Boston ahead of most American peer cities in tackling what archivists call the "duplicate image problem" in government document management.

The issue matters now because cities everywhere are deep into multi-year digitization drives, converting paper records into searchable databases. That process, done at scale and often by multiple contractors working in parallel, routinely produces redundant image files — the same photograph or scanned page appearing dozens of times across different folders and record systems. Storage costs compound quickly. A single duplicated high-resolution image might consume several megabytes; multiply that by tens of thousands of records and municipal IT budgets take a measurable hit. Boston's Office of Innovation and Technology has flagged the problem as a direct drag on its broader open-data initiative, which aims to make city records accessible through a unified public portal by the end of fiscal year 2027.

What Boston Is Actually Doing

The city's approach pairs automated deduplication software — specifically a hashing-based detection system applied to the Legistar document management platform used by the Boston City Council — with manual review by staff at the City Archives. The Archives, physically housed on Canterbury Street in West Roxbury, holds records stretching back to the 17th century. Digitized versions of those records, accumulated through a project that accelerated during the COVID-era office closures of 2020 and 2021, are where the bulk of duplication was found. By January 2026, city technology staff had identified more than 140,000 redundant image files across the archives' digital holdings, according to internal planning documents reviewed by The Daily Boston. Clearing those files freed an estimated 2.3 terabytes of server storage.

The Jamaica Plain-based nonprofit Boston Digital Equity Coalition, which has worked alongside the city on open-records access projects, has been consulted informally on how the deduplication process affects public searchability. The concern from community advocates is straightforward: removing a file flagged as a duplicate when it is actually a distinct image — even marginally different — destroys a public record. The city's protocols require human sign-off before any deletion, a safeguard that slows the process but protects against irreversible errors.

How Other Cities Compare

Amsterdam began a similar deduplication program for its Stadsarchief — the city's main municipal archive — in 2023, using open-source perceptual hashing tools developed in partnership with the University of Amsterdam. The Dutch program is further along in automation but has faced criticism from heritage groups over transparency in deletion decisions. Seoul's municipal government digitized roughly 12 million pages of administrative records between 2020 and 2025 and has acknowledged ongoing duplication issues, though a formal remediation program has not yet been publicly announced. London's Metropolitan Archives completed a deduplication audit of its post-2000 digital holdings in March 2025, finding redundancy rates of around 18 percent across scanned planning documents — a figure that Boston's own internal review roughly mirrors.

Philadelphia, which faces a comparable digitization backlog to Boston's, has not launched a formal deduplication program as of this writing. Washington, D.C.'s Office of Public Records acknowledged the problem in a 2025 budget submission but requested funding only for expanded storage rather than deduplication. That contrast helps explain why Boston's methodology is being studied: instead of buying more server capacity to absorb the redundancy, the city is trying to eliminate it.

For residents who use Boston's public records portal — accessible at records.boston.gov — the practical upshot should eventually be faster search results and fewer duplicate hits when requesting historical property documents, zoning filings, or council meeting materials. The city expects the first phase of the cleanup to wrap up by October 2026, with a second phase covering pre-2000 scanned records beginning in early 2027. Anyone with concerns about specific records being incorrectly flagged for deletion can contact the City Archives directly at its Canterbury Street office, where staff review requests on a rolling basis.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.