By the Numbers: Boston's Digital Archives Are Drowning in Duplicate Images
A closer look at the scale of the duplicate-image problem across Boston's public institutions reveals a data crisis hiding in plain sight.
A closer look at the scale of the duplicate-image problem across Boston's public institutions reveals a data crisis hiding in plain sight.

Boston's public libraries, city agencies, and university digital repositories collectively hold an estimated tens of millions of image files — and a growing share of that storage is eaten up by duplicates, near-duplicates, and redundant scans that cost real money to maintain. The problem is not abstract. At the Boston Public Library's Digital Repository Service on Boylston Street, staff have flagged duplicate-image management as a priority concern in ongoing internal audits, as the institution's digital holdings have expanded rapidly since 2020.
The timing matters because storage costs are no longer cheap, and Boston's institutional budgets are under pressure. Cloud storage pricing from major vendors has crept upward since 2024, with enterprise-tier object storage now running anywhere from $0.02 to $0.023 per gigabyte per month — a figure that compounds fast when a single digitization project can generate hundreds of thousands of image files. For a mid-sized university archive holding 40 terabytes of image data with a conservative 15 percent duplication rate, that redundancy alone can translate to roughly $1,400 in avoidable annual storage spend.
Northeastern University's Library on Huntington Avenue and the Harvard Library system in Cambridge have both invested in automated deduplication tooling in recent years, part of broader digital preservation strategies. The challenge is not just identical copies — it's the near-duplicate problem, where the same archival photograph is scanned at 300 DPI, then again at 600 DPI, saved in both TIFF and JPEG formats, and sometimes uploaded twice through separate workflows. Automated detection tools flag exact hash matches easily, but perceptual hashing algorithms that catch visually similar images are still maturing.
Across the country, the Digital Preservation Coalition has documented that storage waste from unmanaged duplication commonly runs between 10 and 30 percent of total image repository size. Apply even the low end of that range to Boston's institutional context — the BPL alone holds more than 1.7 million digitized items as of its most recent public reporting — and the numbers add up quickly. Each redundant high-resolution TIFF file can run 50 to 100 megabytes. At scale, that is not a rounding error; it is a budget line.
The city's biotech and university economy has accelerated digitization demand. Partners HealthCare's imaging archives and the Massachusetts Institute of Technology's collections in Cambridge each face internal pressure to consolidate sprawling file systems built up over two decades of incremental digitization grants. Federal Library Services and Technology Act funding, administered in Massachusetts through the Massachusetts Board of Library Commissioners, has historically supported digitization but has not consistently required deduplication protocols as a grant condition.
Some local repositories are moving toward open-source tools like DROID — Digital Record Object Identification — developed by The National Archives in the UK, which can batch-process file inventories and surface duplicates before they migrate to cloud storage. The Massachusetts Digital Commonwealth program, which aggregates digital collections from institutions across the state including the Bostonian Society on Washington Street, has begun piloting stricter ingest validation rules to catch duplicate uploads before they enter the shared repository.
The practical advice for smaller institutions — neighborhood historical societies in Jamaica Plain, archival collections at UMass Boston on Morrissey Boulevard — is to run deduplication audits before any major storage migration, not after. Cleaning up a 10-terabyte archive before moving it to the cloud is a matter of staff hours. Cleaning it up afterward, once files are distributed across multiple storage tiers, is a project measured in months.
The July 4th holiday weekend, when skeleton crews staff most Boston institutions, is actually a common moment for automated audit scripts to run uninterrupted. By Monday morning, a number of local archivists will have fresh duplication reports sitting in their inboxes — and the hard work of deciding what to delete, and what to keep, begins all over again.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Boston
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News


