Boston Institutions Race to Purge Duplicate Images From Digital Archives This Week
Libraries, hospitals, and city agencies across Boston are accelerating duplicate-image cleanup projects as storage costs spike and AI-cataloguing tools mature.
Libraries, hospitals, and city agencies across Boston are accelerating duplicate-image cleanup projects as storage costs spike and AI-cataloguing tools mature.

Duplicate images are quietly eating Boston's digital budgets. This week, at least three major institutions — the Boston Public Library's Digital Repository, Massachusetts General Hospital's radiology records division, and the city's Office of Digital Innovation — confirmed they have active projects underway to identify and remove redundant image files clogging their servers, a problem that has compounded steadily since the pandemic-era rush to digitize physical records.
The timing is not accidental. Cloud storage pricing from major vendors rose again in the first quarter of 2026, with enterprise tiers for organizations storing more than 100 terabytes increasing by an average of 12 percent year-over-year, according to pricing disclosures from multiple providers. For institutions like BPL, which digitized roughly 1.4 million items through its Digitization Services program on Dartmouth Street, redundant copies of the same photograph or document scan can represent a measurable fraction of annual IT spend.
The Boston Public Library moved the furthest, publicly. Staff at the BPL's Digital Repository team — based at the Central Library in Copley Square — confirmed this week that a six-month deduplication audit wrapped its first phase on June 30, targeting the Norman B. Leventhal Map & Education Center's image collection. The audit used perceptual hashing software to flag near-identical scans, a method that catches not just exact duplicates but images rescanned at slightly different resolutions. The library has not yet released figures on how many files were flagged, but the project represents one of the more systematic efforts any Boston cultural institution has undertaken on this front.
Meanwhile, at MGH, the radiology information technology team has been running a parallel initiative since April under a broader electronic health records consolidation tied to Mass General Brigham's system-wide Epic upgrade. Duplicate DICOM image files — the standard format for medical scans — have been a persistent headache in hospital IT for years. Industry estimates from KLAS Research, a healthcare IT analytics firm, suggest that duplicate imaging records affect between 8 and 22 percent of patient files at large academic medical centers. MGH has not released its own figures publicly, but the EHR consolidation project was confirmed in Mass General Brigham's 2025 annual report as a priority initiative expected to run through the end of fiscal year 2027.
The city's Office of Digital Innovation, headquartered at City Hall on Congress Street, is dealing with a different flavor of the same problem: duplicate photographs and renderings stored across multiple departments as part of the Wu administration's open-data initiative. Planning documents filed with the Boston Planning Department for the Newmarket and Dorchester Avenue corridor projects generated thousands of image assets that, according to a city procurement notice posted in May, were stored redundantly across at least four separate departmental SharePoint environments. A vendor contract for deduplication tools was posted to the city's procurement portal, BAIS, on May 19, with a value of $87,400.
The practical stakes extend well beyond IT housekeeping. At the BPL, duplicate images can surface incorrectly in public search results, returning the same map or photograph multiple times and degrading the research experience for students at nearby institutions like Northeastern University on Huntington Avenue or Emerson College on Boylston Street. In the medical context, duplicate imaging files carry patient safety implications — a radiologist pulling up a patient record and seeing two versions of the same scan from different dates can face unnecessary diagnostic confusion.
The convergence of AI-powered cataloguing tools and rising storage costs has created what IT administrators describe as an inflection point. Tools that use machine learning to identify visually similar images — not just identical file checksums — have become significantly more accurate and affordable in the past 18 months, making the cleanup projects that institutions kept deferring suddenly tractable.
For Boston residents and researchers, the practical upshot is worth watching. The BPL's Digital Repository team expects to publish updated collection counts through its online portal by September 2026 once the first deduplication phase is complete. The city's vendor contract runs through January 2027. Anyone who relies on Boston's public digital archives — from genealogists using the library's collections to journalists pulling planning documents — should expect cleaner, faster search results on the other side of these projects, even if the work itself remains largely invisible while it happens.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Boston
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News


