Boston's public institutions collectively hold tens of thousands of duplicate digital image files across their archival databases — a sprawling, largely invisible problem that is costing the city and its partner organizations real money and measurable staff time, according to digital preservation specialists familiar with the issue.
The timing matters. Mayor Michelle Wu's administration has pushed a broader open-data and government modernization agenda since 2022, and several Boston departments are now mid-cycle on digital infrastructure upgrades. That means procurement decisions made in the next 12 to 18 months will shape how — or whether — duplicate image bloat gets addressed before it compounds further.
How Big Is the Problem?
Industry benchmarks from the Federal Agencies Digital Guidelines Initiative, a working group that includes the Library of Congress, suggest that large public repositories routinely find that 15 to 30 percent of ingested image assets are functional duplicates or near-duplicates — files that differ only in resolution, compression level, or file-naming convention. Apply the lower end of that range to the Boston Public Library's Digital Commonwealth platform, which as of its most recent public report held more than 1.4 million digitized items, and the arithmetic produces an uncomfortable figure: potentially 210,000 redundant image objects consuming server space and metadata labor.
Storage is not free. Enterprise-grade archival storage on the East Coast runs roughly $3,000 to $5,000 per terabyte per year when you factor in redundancy, backup, and migration costs. High-resolution image files — the kind the BPL and the City of Boston Archives on City Hall Plaza routinely ingest from scanning projects — average 50 to 150 megabytes each. A library of 210,000 such files represents somewhere between 10 and 31 terabytes of potentially wasteful overhead.
Northeastern University's Digital Scholarship Group, based on Huntington Avenue, has been working with partner institutions on exactly this kind of deduplication challenge. The group has noted that manual review of flagged duplicates is the single largest labor cost in any remediation project — a finding consistent with what archivists at peer institutions in New York and Chicago have reported publicly.
What Deduplication Actually Requires
Automated duplicate detection software — tools like DROID, developed by the UK National Archives, or open-source perceptual hashing libraries — can scan a repository and flag probable duplicates in days rather than the months that manual review requires. The catch is integration. Boston's City Archives and the BPL's Digital Commonwealth run on different content management systems, and neither has publicly announced a unified deduplication protocol as of July 2026.
The Massachusetts Board of Library Commissioners, headquartered in Boston at 98 North Washington Street, administers federal Library Services and Technology Act funds that can be directed toward exactly this kind of infrastructure work. In fiscal year 2025, the MBLC distributed grants totaling several million dollars across the state for digital preservation projects, though the specific allocation for deduplication initiatives was not broken out in the agency's published grant summaries.
Jamaica Plain-based community archive efforts, including digitization drives run through the Hyde Square Task Force, face the same problem at a smaller scale: volunteer-led scanning projects frequently produce multiple versions of the same photograph with inconsistent filenames, and there is no systematic process to reconcile them against the main BPL repository before upload.
For institutions looking to get ahead of the problem, digital preservation consultants recommend a three-step approach: first, run a full perceptual hash audit across all holdings to establish a baseline duplicate rate; second, set ingest rules that automatically flag probable duplicates before they enter the archive; and third, build a quarterly review cycle into staff workflows rather than treating deduplication as a one-time project. The upfront cost of a hash audit for a mid-sized repository typically runs between $15,000 and $40,000 depending on collection size and system complexity — a fraction of the ongoing storage costs that unchecked duplication generates over a five-year budget horizon. Boston's institutions, with major digitization funding cycles due for renewal by late 2027, have a narrow window to act before the problem grows another full generation larger.