Boston's civic and academic institutions spent the first week of July wrestling with a persistent digital housekeeping problem: duplicate images clogging archives, slowing public-facing websites, and complicating records requests at a moment when the city is pushing hard on transparency and digital accessibility. The issue surfaced publicly this week as staff at several Boston Public Library branches flagged backlogs in their digitization queue tied directly to redundant file storage.
The timing matters. Mayor Michelle Wu's administration has made open data and digital service delivery central planks of her second-term agenda, and the city's Information and Technology Department has been under pressure to modernize legacy systems across multiple departments. Duplicate image files — the same photograph stored under multiple file names, or the same scan uploaded in different resolutions without a clear master record — are a low-glamour problem that compounds fast once an institution's digital holdings grow past a certain threshold. Boston's holdings are substantial. The city holds tens of millions of records across municipal, educational, and cultural institutions.
Where the Problem Is Showing Up
The Boston Public Library's Copley Square central branch has been running a phased digitization project for its Maps and Photographs collection for several years. Staff there identified this spring that a meaningful share of newly scanned items were being flagged as probable duplicates of existing entries — slowing the catalog update pipeline and forcing manual review. The BPL has not released specific figures on the scope of the backlog, but the problem is consistent with patterns seen at peer institutions managing large historical photograph collections.
Across town in Roxbury, the City Archives at the Strand Theatre complex on Dorchester Avenue has faced similar friction. Departments submitting records for long-term storage have at times uploaded multiple versions of the same photograph — different crops, different exposure adjustments — without standardized naming conventions to distinguish them. The result is storage redundancy and, in at least some cases, conflicting metadata attached to the same underlying image.
Boston's university corridor is not immune. Northeastern University's digital library services team, which manages a substantial collection of neighborhood documentation from the South End and Lower Roxbury going back to the 1960s urban renewal period, has been piloting automated deduplication software since January 2026. The pilot covers roughly 40,000 image files in one test collection. Results from the first phase of the pilot have not yet been made public.
What It Costs — and What Comes Next
Storage costs are a concrete driver of urgency. Cloud storage pricing for large institutional archives has risen since 2022, and duplicate files mean institutions pay to store the same data twice. Industry benchmarks suggest that duplicate and redundant files can account for 20 to 30 percent of total storage in poorly managed digital archives — though the specific figure varies widely by institution and workflow history. For a mid-size city archive operating on a fixed budget, that overhead is not trivial.
The practical fix is less exciting than the problem sounds. Archivists are using perceptual hashing tools — software that generates a fingerprint for each image based on visual content rather than file name — to surface probable duplicates for human review. The human step remains essential because two visually similar photographs of, say, Faneuil Hall taken thirty years apart are not duplicates; they are distinct historical records. Automating the initial sort while keeping trained staff in the loop is the current best practice.
For residents who use the BPL's digital collections or submit public records requests to city departments that include photographs — building permits, code enforcement files, police records — the immediate practical effect is that some requests may take longer than the standard ten-business-day window while staff work through the queue. The city's public records office has not announced any formal extension of processing timelines.
The longer-term picture is that institutions that invest in deduplication now will be better positioned as the city pushes more services online. The Wu administration's digital equity initiative, which has extended broadband access to parts of Dorchester and East Boston, is also driving higher demand for accessible digital archives. More users mean more pressure on backend systems to work cleanly. Getting image libraries in order is unglamorous, but the work is underway.