Boston's public institutions are sitting on a problem measured in terabytes. Across municipal databases, university digital archives, and the Massachusetts Bay Transportation Authority's document management systems, duplicate image files — photographs, scanned records, design renderings — have accumulated for years without systematic removal. The financial and operational cost of that neglect is now becoming harder to ignore.
The issue surfaces at a moment when the city's technology infrastructure is under pressure. Mayor Michelle Wu's administration has pushed digitization of city services as part of a broader modernization agenda, moving permitting, inspections, and community-engagement records online. That migration has accelerated the volume of files entering city servers — and, according to digital asset management specialists, tends to multiply duplicate images at roughly three to four times the rate of manual filing systems, because automated uploads rarely include deduplication checks at the point of entry.
What the Data Actually Looks Like
Industry benchmarks published by the Storage Networking Industry Association put the share of redundant files in unmanaged enterprise archives at between 25 and 40 percent. Applied to a mid-sized municipal government like Boston's, that range suggests a meaningful chunk of whatever the city spends annually on cloud and on-premise storage is funding copies of files already in the system. The city's Office of Innovation and Technology oversees digital infrastructure across dozens of departments housed in City Hall on Cambridge Street, but a unified deduplication audit has not been publicly reported as completed.
The problem is not unique to government. Northeastern University's library system, which maintains digital collections across its Snell Library on Huntington Avenue, and the Boston Public Library's Digital Commonwealth project — a statewide repository of digitized historical materials — both operate archives where duplicate image ingestion is a documented challenge. Digital Commonwealth, which hosts more than 1.6 million items from Massachusetts cultural institutions, implemented automated similarity-detection tools starting in 2022, but administrators have acknowledged publicly that legacy collections uploaded before that year remain largely unaudited.
The MBTA faces a parallel issue in its engineering and infrastructure documentation. The authority holds decades of track diagrams, station photographs, and construction records digitized from paper originals. When the Green Line Extension project completed its final segment to Union Square in Somerville in 2022, post-project documentation uploads were reported internally to have generated significant file redundancy — a common outcome when multiple contractors submit overlapping progress-photo sets to a shared repository.
The Cost of Doing Nothing
Storage is not free. AWS S3 cloud storage — the type commonly used by Massachusetts state agencies under the state's MassIT procurement framework — runs at roughly $0.023 per gigabyte per month at standard rates as of mid-2026. For an archive holding 500 terabytes with a 30 percent duplication rate, eliminating redundant files would represent potential savings exceeding $40,000 annually on storage alone, before accounting for reduced backup times and faster search performance.
Boston's biotech corridor along Longwood Avenue and Binney Street in Cambridge generates its own version of the same problem. Research institutions, including those affiliated with Harvard Medical School and the Broad Institute, maintain imaging datasets — microscopy photographs, clinical scan exports — where duplication rates in unmanaged systems have been measured at 20 to 35 percent in peer-reviewed data management studies published in journals like Scientific Data.
The practical remedies are well-established. Perceptual hashing algorithms can identify visually identical or near-identical images even when file names differ. Tools built on that approach, including open-source options like ImageDedup, can process tens of thousands of files per hour on standard server hardware. The obstacle in most institutional settings is not technology — it is the organizational decision to schedule and fund an audit in the first place.
For Boston's city departments, that decision sits with the Office of Innovation and Technology in coordination with each department's records manager. Institutions running their own archives — the BPL, Northeastern, the MBTA — will each need to build deduplication into standard ingestion workflows rather than treating it as a one-time cleanup project. The longer the audit is deferred, the larger the redundant archive grows, and the more expensive the eventual remediation becomes.