The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Tell a Costly Story

From City Hall records to university libraries, redundant image files are quietly eating up storage budgets and distorting public data collections across the region.

By Boston News Desk · Published 4 July 2026, 2:40 pm

3 min read

Boston's public and academic institutions are sitting on a sprawling, largely unaudited problem: tens of thousands of duplicate digital images spread across government servers, university repositories, and nonprofit archives, costing real money and undermining the integrity of public records. A review of data management practices across several city-linked organizations puts the scale of the issue in sharper focus ahead of a broader push toward digital transparency under Mayor Michelle Wu's open-data initiative.

The timing matters. Wu's administration has committed to expanding the city's open data portal — hosted at data.boston.gov — and data quality is front and center in that effort. Duplicate image files are not simply a storage nuisance. When the same photograph or scanned document appears under multiple file names or in multiple directories, it inflates dataset counts, skews search results, and can generate false matches in automated processing pipelines used by planners, researchers, and journalists alike.

The Scale of the Problem in Boston's Institutions

At Northeastern University's library system on Huntington Avenue, archivists have been working since early 2025 to audit digitized collections that span more than a century of New England history. Library staff identified that roughly 18 percent of images ingested into one collection between 2019 and 2023 were near-duplicates — identical or near-identical scans uploaded under different metadata tags, often because multiple staff members digitized the same physical item independently. That figure, shared in an internal review document that became part of a Digital Preservation Coalition working paper, is broadly consistent with rates seen in comparable academic digitization projects in the United States and the United Kingdom.

The Boston Public Library's Norman B. Leventhal Map & Education Center, located on Boylston Street in Copley Square, faced a similar reckoning when it expanded its online geospatial image collection in 2024. Automated deduplication tools flagged more than 2,400 image files as likely duplicates out of a collection that then numbered around 14,000 items — a duplication rate approaching 17 percent. Storage costs for large TIFF files, the standard archival format, run roughly $0.023 per gigabyte per month on commercial cloud platforms as of mid-2026, according to published pricing from Amazon Web Services. For institutions holding hundreds of thousands of high-resolution scans, redundant files can add up to thousands of dollars annually in unnecessary cloud spend.

The MBTA's public communications archive — used to store press photos, infrastructure images, and construction documentation — is another case in point. The Authority has been under scrutiny since the Federal Transit Administration's oversight period that began in 2022, and internal document management has been part of broader operational reform efforts. Duplicate image files in procurement and construction records can complicate audit trails, a concern that matters when federal reviewers are examining project documentation.

Detection Tools and What Comes Next

The technical solutions are well established. Perceptual hashing algorithms — software tools that generate a compact fingerprint for each image based on visual content rather than file name — can scan a collection of 100,000 images in under two hours on standard server hardware and flag duplicates with accuracy rates above 95 percent, according to published benchmarks from the open-source tool pHash. The harder work is human: deciding which copy to keep, reconciling metadata, and updating any links or citations pointing to the retired file.

For Boston residents and researchers who rely on city and university digital collections, the practical advice is straightforward. Before building any dataset using images drawn from data.boston.gov, the Leventhal Center, or any digitized archival collection, run a basic hash-comparison check. Free tools including ExifTool and ImageMagick can accomplish this on a personal laptop. Institutions that have not audited their collections since before 2023 should treat that as a gap, not a minor housekeeping item.

The city's open-data team is expected to release updated data quality standards for image-based datasets later this year as part of the Wu administration's digital infrastructure roadmap. How rigorously those standards address deduplication will be worth watching when the document drops.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.