Boston's public institutions are sitting on a problem hiding in plain sight: thousands of duplicate image files scattered across municipal servers, university libraries, and civic archives — and the bill for storing them keeps growing. Librarians, technologists, and city administrators have started pushing harder this summer for a coordinated approach to identifying and replacing redundant files, arguing that sprawling digital collections have outgrown the patchwork systems managing them.
The issue has gained urgency because several major Boston institutions are mid-cycle on digital infrastructure contracts. The Boston Public Library's central branch on Boylston Street, which manages one of the largest municipally held photograph collections in New England, is in the middle of a multi-year digitization initiative. Northeastern University's Digital Repository Service, based at its Snell Library on the Huntington Avenue campus, is similarly expanding its holdings. Both operations have flagged duplicate image accumulation as a cost and access problem during internal reviews, according to publicly available project documentation.
What's Driving the Redundancy Problem
Digital collections grow fast and rarely in a straight line. Institutions scan the same physical photographs multiple times as technology improves. Staff upload files without checking whether a version already exists. Mergers between departmental archives — common in universities reorganizing after the pandemic — compound the mess. The result is servers holding three, four, or five copies of the same image at different resolutions, with inconsistent metadata attached to each.
Archivists and information scientists have a technical term for the fix: deduplication. The process uses hash-matching algorithms to identify files that are byte-for-byte identical or perceptually similar, then flags them for human review before anything is deleted. The human-in-the-loop step matters because two images that look identical to a machine can have different provenance records — one might carry a signed usage rights agreement that the other lacks.
The Massachusetts Board of Library Commissioners, which administers state grants to public libraries, has included digital collection integrity as a criterion in its Competitive Grants program since at least fiscal year 2024. Libraries applying for technology funding are now expected to demonstrate they have deduplication protocols or a plan to adopt them. That requirement has filtered down to smaller branches, including the Codman Square branch in Dorchester and the Jamaica Plain branch on South Street, both of which have received state technology support in recent grant cycles.
The Cost Argument Is Starting to Land
Storage is not free. Cloud hosting for large image libraries runs institutions roughly $23 to $35 per terabyte per month on commercial platforms, depending on access frequency and redundancy tiers — figures that add up quickly when a single archive holds hundreds of thousands of high-resolution scans. For budget-pressured city departments still operating under Mayor Michelle Wu's constrained fiscal 2026 capital plan, that math has become harder to ignore.
Information science faculty at Simmons University, which runs one of the country's oldest library and archival studies programs from its main campus on the Fenway, have been incorporating deduplication workflows into graduate coursework. The practical emphasis reflects what employers are asking for: institutions want newly hired archivists who can operate deduplication software, not just cataloging tools.
The technology itself has matured considerably. Open-source tools like FIDO and commercial platforms purpose-built for cultural heritage collections can now process tens of thousands of images in an overnight batch, generating reports that curators review the following morning. The labor bottleneck is no longer the detection phase — it's the human review that follows, particularly when legal rights or donor agreements are attached to specific file versions.
For Boston institutions moving forward, archivists recommend establishing a single authoritative master file registry before any large-scale scanning project begins — a lesson that is easier to implement at the outset than to retrofit onto an existing collection. The Boston City Archives, located in West Roxbury at the Hyde Park Avenue complex, began piloting a registry-first workflow for new municipal photograph intake earlier this year, a model that digital preservation advocates say other city departments could adopt before the next round of technology contracts comes up for renewal in 2027.