Boston's public agencies and research institutions are sitting on millions of redundant digital images — duplicate photographs clogging servers, slowing retrieval systems, and quietly draining IT budgets — and a growing coalition of archivists, city officials, and university technologists says the problem has reached a point where it can no longer be ignored.
The issue has sharpened this summer after the Mayor's Office of New Urban Mechanics flagged digital-storage inefficiency as a line item in ongoing technology audits connected to the Wu administration's open-data initiative. The city's push toward greater digital transparency, which has included expanded public records portals for neighborhoods like Dorchester and Jamaica Plain, has exposed just how cluttered the underlying file systems have become. Duplicate images — the same photograph filed under multiple case numbers, departments, or date stamps — inflate storage costs and make accurate public records harder to search and verify.
What Officials and Experts Are Saying
At Northeastern University's library on Huntington Avenue, digital preservation specialists have been grappling with the same challenge in an academic context. Northeastern's Digital Scholarship Group, which manages large-scale digitization projects for Boston-area collections, has publicly discussed the problem of hash-collision duplicates — files that appear different to human cataloguers but are byte-for-byte identical — as a systemic flaw in how institutions ingest photographs at scale. Librarians there have pointed to the Boston City Archives' Jamaica Plain neighborhood photo collection, portions of which were digitized between 2018 and 2022, as an example where duplicate detection tools were not applied consistently at the point of upload.
The Massachusetts Board of Library Commissioners, which provides state funding and standards guidance to public libraries across the Commonwealth, has in recent years encouraged institutions to adopt deduplication protocols as part of broader digital-infrastructure grants. Those grants, administered through the state's municipal libraries program, have funded technology upgrades at the Boston Public Library's Copley Square branch and at branch locations in Roxbury and East Boston. Without deduplication built into the intake workflow, archivists say, even well-funded digitization projects can double or triple their true storage footprint within a few years.
On the municipal side, Boston's Department of Innovation and Technology — which oversees the city's cloud infrastructure contracts — has been reviewing storage utilization across city departments as part of a broader IT modernization effort tied to the fiscal year 2026 budget cycle. Storage costs for municipal government cloud services have climbed steadily nationwide; according to a 2025 report by the National League of Cities, local governments across the U.S. collectively spend an estimated $2.4 billion annually on cloud and data-center infrastructure, with redundant file storage identified as one of the top controllable cost drivers.
Practical Steps Taking Shape
At MIT's campus in Kendall Square, researchers affiliated with the Computer Science and Artificial Intelligence Laboratory have been developing open-source tools designed to flag near-duplicate images — photographs that are not byte-identical but are visually redundant — within large institutional databases. That work, which has been presented at archival-technology conferences, is drawing interest from Boston-area cultural institutions including the Massachusetts Historical Society on Boylston Street and the Bostonian Society, which manages collections at the Old State House.
For city residents and community groups who rely on public records portals to track development projects in neighborhoods like South End and Allston, the practical stakes are real. Duplicate image records can appear as multiple distinct entries in search results, creating confusion about permit histories or property inspections. Archivists recommend that any institution planning a new digitization project build automated deduplication — using cryptographic hashing at minimum — into its workflow before the first file is uploaded, rather than attempting a cleanup retroactively. Retroactive deduplication on a large existing archive can take months and carries a risk of accidental deletion if not carefully supervised.
The Mayor's Office of New Urban Mechanics is expected to release updated guidelines for city departments on digital-file management before the end of the third quarter of 2026. Archivists and technologists who have been consulted in that process say the guidance will likely establish baseline standards for duplicate detection across agencies — a modest but concrete step toward bringing Boston's digital housekeeping in line with what its ambitions for open government actually require.