Boston's public libraries, city agencies, and university digital archives collectively hold hundreds of thousands of duplicate image files — redundant scans, mirror copies, and re-uploaded photographs that consume server storage, slow retrieval systems, and cost taxpayers and institutions real money every fiscal year. The problem is structural, persistent, and, according to digital archivists working at institutions along the Route 128 corridor, getting worse as digitization projects accelerate without coordinated deduplication standards in place.
The timing matters. Mayor Michelle Wu's administration has pushed hard on government transparency and digital-services modernization since 2022, and several city departments have been migrating records to cloud-based platforms through the Office of Digital Equity and Technology. But migration without deduplication is like moving a cluttered apartment box by box — the mess travels with you.
The Scale of the Problem in Boston's Institutions
At the Boston Public Library's Digitization Services unit on Boylston Street, staff manage a collection that has grown to more than 2.4 million digitized items, a figure the BPL has cited publicly in its annual reports. Industry-standard estimates from digital preservation organizations suggest that between 15 and 30 percent of files in large digitized collections are functionally redundant — meaning the BPL could theoretically be storing somewhere between 360,000 and 720,000 duplicate or near-duplicate image files, though the library has not publicly released its own deduplication audit figures.
Northeastern University's Digital Scholarship Group, based on the Huntington Avenue campus, has grappled with similar issues in its archives. University digital storage is not free: enterprise-grade cloud storage through platforms commonly used by academic institutions runs roughly $20 to $23 per terabyte per month for managed archival tiers. A collection bloated by 20 percent duplicates across even 500 terabytes translates to roughly $2,000 to $2,300 in wasted monthly expenditure — before accounting for staff time spent tagging, cataloguing, and retrieving files that are effectively identical to ones already in the system.
The City of Boston's own records management has also drawn scrutiny. The city's Archive and Records Management division handles tens of thousands of digitized permits, inspection photographs, and planning documents generated each year by departments including the Boston Planning Department and the Inspectional Services Department. With the acceleration of housing-production reviews in Jamaica Plain and Dorchester — two neighborhoods where the Wu administration has prioritized zoning reform and new construction approvals — the volume of uploaded site photographs alone has climbed sharply since 2023.
Why Deduplication Is Harder Than It Sounds
Identifying duplicate images is not as simple as matching file names. Two photographs of the same Dorchester triple-decker, shot seconds apart on the same device, will have different file sizes, timestamps, and metadata signatures, making basic file-comparison tools ineffective. Perceptual hashing algorithms — software tools that compare visual content rather than raw file data — can catch near-duplicates, but deploying them across legacy database systems requires IT investment and staff retraining that many institutions have deferred.
The Massachusetts Board of Library Commissioners has funded digitization grants totaling several million dollars over the past five years to help smaller public libraries along the MBTA commuter rail network bring physical collections online. Those grants have generally not required recipient institutions to adopt specific deduplication protocols before or after upload — a gap that digital preservation professionals have flagged in public forums, including at the 2025 New England Archivists spring meeting held in Boston.
For city agencies, the practical path forward involves three steps that digital records managers have outlined in guidance documents: conducting a baseline storage audit to quantify redundancy, selecting a perceptual hashing or content-aware deduplication tool compatible with existing database infrastructure, and establishing an upload protocol that runs automated duplicate checks before a file is committed to the archive. Some institutions in comparable dense urban environments — Chicago's city archive, for instance, completed a deduplication pass in 2024 — have reported storage reductions of 18 to 22 percent after a single systematic cleanup pass.
For Boston, a city spending heavily on digital infrastructure and pushing housing and transit data into public-facing platforms, getting the underlying data hygiene right is not optional. The numbers behind the duplicate-image problem are unglamorous. They are also, dollar for dollar, worth fixing.