The Daily Boston

Boston news, every day

News

How Boston's Digital Archives Ended Up Full of the Same Photo Twice: The Story Behind the Duplicate Image Problem

Years of fragmented city digitization projects, competing university databases, and a rapid pandemic-era upload push left Boston's public records systems riddled with redundant images — and now officials are working to clean it up.

By Boston News Desk · Published 4 July 2026, 2:57 pm

3 min read

How Boston's Digital Archives Ended Up Full of the Same Photo Twice: The Story Behind the Duplicate Image Problem
Photo: Photo by Juliana Çupa on Pexels

Boston's public-facing digital records repositories — spanning city hall document portals, Boston Public Library archives, and neighborhood planning databases — contain tens of thousands of duplicate image files, the product of more than a decade of overlapping and often uncoordinated digitization efforts. The problem is not new, but it has become impossible to ignore as the City of Boston's Department of Innovation and Technology pushes a 2026 infrastructure consolidation that is exposing the full scale of the redundancy for the first time.

The stakes are real. Duplicate images inflate storage costs, slow database queries, and — most critically — undermine the reliability of planning documents that city staff and residents rely on in neighborhoods like Jamaica Plain and Dorchester, where housing production decisions hinge on accurate parcel records and historical survey photography.

A Problem Built Layer by Layer

The duplication did not happen overnight. It is the accumulated result of at least four distinct waves of digitization that were never reconciled against each other. The first major push came in the early 2000s, when the Boston Redevelopment Authority — now the Boston Planning and Development Agency, headquartered at One City Hall Square — began scanning legacy permit files. A second wave followed after the 2013 FEMA flood-mapping update required coastal imagery to be re-uploaded to a separate federal-compatible system. Neither dataset was deduplicated against the other.

The third and largest contributor was the March 2020 to December 2021 emergency digitization sprint, when city departments shifted to remote operations and staff began uploading documents from home using personal scanners, cloud drives, and at least three different file-naming conventions. The Boston Public Library's Copley Square branch, which houses the city's primary microfilm conversion lab, processed backlogs from multiple agencies simultaneously during that period, frequently generating duplicate TIF and JPEG exports from the same source negatives.

The fourth layer came from university partnerships. Northeastern University's library system and the Harvard Map Collection both contributed digitized historical imagery to the Boston City Archives under memoranda of understanding that contained no deduplication requirements. By some internal estimates — figures that city officials have not yet released publicly — the Boston Planning and Development Agency's spatial data portal alone may contain upward of 40,000 redundant image entries, though that number is unverified pending the ongoing audit.

What the Cleanup Actually Involves

The current remediation effort is being driven by the Mayor's Office of New Urban Mechanics, which was tasked in January 2026 with coordinating a cross-departmental data quality initiative under the Wu administration's broader open-government agenda. The project involves deploying perceptual hashing algorithms — software tools that identify visually identical or near-identical images regardless of filename — across the city's primary content management system, which runs on an instance of the open-source platform Nuxeo.

The practical geography of the problem is concentrated in two places. The Boston City Archives, located at 201 Rivermoor Street in West Roxbury, holds the physical originals against which digital files must be verified. The Norman B. Leventhal Map and Education Center at the Boston Public Library downtown is the secondary verification node for historical cartographic imagery. Staff at both locations have been cross-referencing digital exports against physical holdings since February 2026.

Storage costs for municipal cloud infrastructure have risen sharply. Amazon Web Services S3 pricing — which the city uses for backup storage — moved to approximately $0.023 per gigabyte per month in 2025, meaning that every redundant gigabyte of image data carries a real, ongoing cost. For a repository with millions of files, the cumulative figure becomes significant across a full fiscal year.

The Department of Innovation and Technology has set a September 30, 2026 target for completing the first phase of deduplication, which covers planning and zoning imagery. Residents searching property records through the city's Assessing Online portal or the BPDA's IMAPs tool may notice gaps in image availability during the consolidation window. City staff have advised that requests for specific parcel photographs can still be fulfilled manually through the City Archives during that period — a slower process, but a reliable one while the automated cleanup runs its course.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.