Boston's public institutions are sitting on a growing pile of digital clutter — and the people responsible for managing it want the city to do something about it before the problem gets worse. Duplicate images, defined broadly as identical or near-identical digital files stored multiple times across separate systems, have quietly ballooned inside municipal databases, university archives, and the city's sprawling network of public-facing websites, according to administrators and digital asset specialists working in the sector.
The issue surfaced publicly this spring when the Boston Public Library's digital collections team flagged redundancy problems inside the Leventhal Map and Education Center archive on Boylston Street. Staff there discovered that a significant share of scanned historical maps had been ingested multiple times into the library's content management system — consuming server storage, slowing retrieval times, and complicating metadata tagging. The library has not published a full audit, but the problem is not unique to Copley Square.
Why It Matters Now
Budget pressures are making redundant data expensive rather than merely inconvenient. Cloud storage costs for municipal governments have climbed sharply since 2023, with some public sector contracts running upward of $0.023 per gigabyte per month on standard tiers — a figure that multiplies fast when thousands of duplicate files sit undetected. The City of Boston's Department of Innovation and Technology, which oversees the city's digital infrastructure from its offices near City Hall Plaza, has been working through a broader data governance initiative started under Mayor Michelle Wu's administration. Digital redundancy is now part of that conversation.
Digital asset management specialists say the core challenge is systemic rather than accidental. Most large institutions — hospitals, universities, city agencies — operate multiple software platforms that don't communicate well with each other. Images get uploaded to a website content management system, a records platform, a shared drive, and an email archive independently, with no automated deduplication running across all four. At Northeastern University, which manages digital assets for dozens of colleges and research centers along Huntington Avenue, IT administrators have described the problem internally as a question of workflow design rather than storage capacity.
At MIT, whose campus straddles the Cambridge-Boston line along the Charles River, the Libraries division has published guidance on deduplication best practices as part of its digital preservation standards — pointing toward hash-based file comparison tools as the most reliable detection method. Perceptual hashing, which identifies visually similar rather than byte-identical images, is now considered the gold standard for photographic archives where the same image may have been saved at different resolutions or cropped slightly differently.
What Officials and Specialists Are Recommending
The Wu administration has not announced a citywide deduplication mandate, but the Department of Innovation and Technology has been reviewing the city's data governance framework through the first half of 2026. Officials working within that process have indicated — without committing to a timeline — that any updated framework would address image and file redundancy as part of broader cloud cost management.
Outside City Hall, the consensus among archivists and technologists is fairly consistent. The Massachusetts Board of Library Commissioners, which funds digitization projects at public libraries across the state, advises institutions receiving grant money to build deduplication checks into their ingest workflows before files ever reach long-term storage. That guidance applies to the dozens of smaller branch library systems beyond the BPL's main Copley Square location, including neighborhood branches in Jamaica Plain and Dorchester that have been digitizing local community records over the past three years.
For institutions still working through backlogged archives, specialists recommend starting with a file-level audit using open-source tools before committing to expensive enterprise solutions. Prioritizing high-traffic image collections — those embedded in public websites or frequently requested through records portals — typically yields the fastest return on cleanup effort. The practical upshot for Boston's public institutions is straightforward: the longer duplicate files accumulate, the more expensive and complicated the eventual cleanup becomes. The conversation has started. The hard work of actually running the audits is still ahead.