Boston's public institutions are sitting on tens of thousands of redundant digital images — duplicated photographs, scanned documents, and archived visuals spread across servers at city agencies, libraries, and universities — and the effort to clean up those records is now drawing serious attention from technologists, archivists, and city administrators. The problem has become acute enough that the Mayor's Office of New Urban Mechanics, which oversees civic technology initiatives, has begun conversations with outside vendors about automated deduplication tools.
The issue matters now for a specific reason: Boston is mid-way through a multi-year digitization push launched under Mayor Michelle Wu's administration that aims to move city records, historical photographs, and planning documents into publicly accessible online repositories. When duplicate files flood those repositories, storage costs climb, search results degrade, and — critically — the public ends up with an unreliable picture of what the city actually holds.
What the Institutions Are Dealing With
The Boston Public Library's Digital Commonwealth platform, which hosts digitized collections from hundreds of Massachusetts cultural institutions, is one of the largest affected systems in the region. Archivists working with the BPL have described the challenge in public presentations at the Simmons University School of Library and Information Science on The Fenway: when multiple partner institutions contribute scans of the same historical photograph or document, the platform can end up with three or four near-identical image files indexed as separate records. Simmons faculty who specialize in digital preservation have argued in professional forums that the solution is not simply deleting files but building smarter ingest workflows that flag likely duplicates before they enter the system.
At Northeastern University's library on Huntington Avenue, staff managing the university's special collections have piloted a perceptual hashing approach — a technique that generates a compact numerical fingerprint for each image and compares it against existing records — to catch duplicates at the point of upload. The approach is gaining traction in academic library circles because it operates without requiring staff to manually review thousands of files.
City agencies face a parallel but distinct version of the problem. The Boston Planning Department, which absorbed the former Boston Planning and Development Agency in a 2024 reorganization, maintains large internal photo archives documenting construction inspections, permit reviews, and neighborhood surveys across Jamaica Plain, Dorchester, and Roxbury. When staff rotate or projects change hands, the same site photograph frequently gets uploaded multiple times under different file names. That redundancy inflates storage costs — enterprise cloud storage for city government can run from roughly $0.02 to $0.05 per gigabyte per month, and archives can run into hundreds of terabytes — and complicates public records requests.
What Comes Next, and What Experts Recommend
Technologists advising Boston-area institutions broadly agree on a few practical steps. First, any institution running a digitization program needs a deduplication audit before scaling up. Second, automated tools should be embedded into upload pipelines rather than applied retrospectively — retrospective cleanup is exponentially more expensive in staff time. Third, metadata standards need to be enforced consistently so that even when two slightly different scans of the same image exist for legitimate reasons, they can be linked rather than siloed.
The Wu administration has not announced a formal citywide policy on duplicate image management as of July 4, 2026, but the Office of New Urban Mechanics has flagged digital asset governance as part of its broader open-data strategy, which was last updated in the spring. Advocates in the civic-tech community who follow Boston's data initiatives say the window for setting standards is now, while the digitization program is still expanding rather than already complete.
For residents and researchers who use public digital archives — whether pulling historical Dorchester neighborhood photographs from Digital Commonwealth or requesting planning images through the city's public records portal — the practical advice is straightforward: if a search returns suspiciously similar results, report the duplication through the platform's feedback mechanism. Institutions say those user reports are among the most reliable early signals they have that a deduplication problem is growing.