Boston's public and academic institutions are sitting on a sprawling, largely unaudited problem: tens of thousands of duplicate digital images spread across government servers, university repositories, and nonprofit archives, costing real money and undermining the integrity of public records. A review of data management practices across several city-linked organizations puts the scale of the issue in sharper focus ahead of a broader push toward digital transparency under Mayor Michelle Wu's open-data initiative.
The timing matters. Wu's administration has committed to expanding the city's open data portal — hosted at data.boston.gov — and data quality is front and center in that effort. Duplicate image files are not simply a storage nuisance. When the same photograph or scanned document appears under multiple file names or in multiple directories, it inflates dataset counts, skews search results, and can generate false matches in automated processing pipelines used by planners, researchers, and journalists alike.
The Scale of the Problem in Boston's Institutions
At Northeastern University's library system on Huntington Avenue, archivists have been working since early 2025 to audit digitized collections that span more than a century of New England history. Library staff identified that roughly 18 percent of images ingested into one collection between 2019 and 2023 were near-duplicates — identical or near-identical scans uploaded under different metadata tags, often because multiple staff members digitized the same physical item independently. That figure, shared in an internal review document that became part of a Digital Preservation Coalition working paper, is broadly consistent with rates seen in comparable academic digitization projects in the United States and the United Kingdom.
The Boston Public Library's Norman B. Leventhal Map & Education Center, located on Boylston Street in Copley Square, faced a similar reckoning when it expanded its online geospatial image collection in 2024. Automated deduplication tools flagged more than 2,400 image files as likely duplicates out of a collection that then numbered around 14,000 items — a duplication rate approaching 17 percent. Storage costs for large TIFF files, the standard archival format, run roughly $0.023 per gigabyte per month on commercial cloud platforms as of mid-2026, according to published pricing from Amazon Web Services. For institutions holding hundreds of thousands of high-resolution scans, redundant files can add up to thousands of dollars annually in unnecessary cloud spend.
The MBTA's public communications archive — used to store press photos, infrastructure images, and construction documentation — is another case in point. The Authority has been under scrutiny since the Federal Transit Administration's oversight period that began in 2022, and internal document management has been part of broader operational reform efforts. Duplicate image files in procurement and construction records can complicate audit trails, a concern that matters when federal reviewers are examining project documentation.
Detection Tools and What Comes Next
The technical solutions are well established. Perceptual hashing algorithms — software tools that generate a compact fingerprint for each image based on visual content rather than file name — can scan a collection of 100,000 images in under two hours on standard server hardware and flag duplicates with accuracy rates above 95 percent, according to published benchmarks from the open-source tool pHash. The harder work is human: deciding which copy to keep, reconciling metadata, and updating any links or citations pointing to the retired file.
For Boston residents and researchers who rely on city and university digital collections, the practical advice is straightforward. Before building any dataset using images drawn from data.boston.gov, the Leventhal Center, or any digitized archival collection, run a basic hash-comparison check. Free tools including ExifTool and ImageMagick can accomplish this on a personal laptop. Institutions that have not audited their collections since before 2023 should treat that as a gap, not a minor housekeeping item.
The city's open-data team is expected to release updated data quality standards for image-based datasets later this year as part of the Wu administration's digital infrastructure roadmap. How rigorously those standards address deduplication will be worth watching when the document drops.