Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering
A growing reckoning with redundant photo files is costing city agencies, universities, and cultural institutions real money and real storage space.
A growing reckoning with redundant photo files is costing city agencies, universities, and cultural institutions real money and real storage space.

Boston's public institutions are sitting on a data problem hiding in plain sight. Duplicate images — identical or near-identical photo files stored multiple times across shared servers and cloud platforms — account for an estimated 20 to 30 percent of total digital storage consumption at mid-size organizations nationwide, according to data management research published by the Storage Networking Industry Association. For a city with as many universities, hospitals, and municipal agencies as Boston, that translates into millions of dollars in avoidable infrastructure spending every year.
The timing matters. The City of Boston's fiscal year 2026 IT budget, passed by the City Council earlier this spring, allocated roughly $47 million to technology operations across municipal departments. As agencies digitize more records — building permits in Roxbury, zoning filings in South Boston, public health data from Mattapan — the volume of stored image files is climbing fast. Without systematic deduplication, that cost compounds.
Two Boston institutions illustrate the scale clearly. The Boston Public Library, headquartered on Boylston Street in Copley Square, has been digitizing its print and photographic collections for over a decade through its Digital Commonwealth program, a statewide repository serving more than 180 member institutions. Archivists there have acknowledged publicly that managing file redundancy across that many contributing organizations is one of the persistent technical challenges of shared repository work — the same image scanned at different resolutions by different institutions ends up stored separately, multiplying storage load without adding informational value.
Northeastern University's library system on Huntington Avenue faces a similar challenge. Its digital preservation infrastructure supports research collections across multiple colleges, and graduate programs in information science there have used institutional deduplication efforts as live case study material. When a single 30-megabyte RAW image file gets duplicated even ten times across departmental drives, the arithmetic turns ugly fast.
The MBTA, which has been under intense scrutiny over its technology spending since the Federal Transit Administration lifted its safety oversight agreement in 2024, maintains thousands of security and operational camera feeds. Footage archiving, when not carefully managed, is one of the fastest generators of duplicate file bloat in any transit system of comparable size.
Cloud storage pricing gives the clearest sense of what duplication actually costs. Amazon Web Services standard S3 storage runs approximately $0.023 per gigabyte per month as of mid-2026. A municipal department storing 10 terabytes of image files — not an unusual figure for a mid-size city agency with active permitting or surveillance functions — pays around $230 a month just for that tier. If 25 percent of those files are duplicates, the agency is burning roughly $57 a month on redundant data. That number scales brutally: across a dozen departments, it becomes nearly $700 a month, or more than $8,000 a year, purely on preventable redundancy.
The broader institutional picture is worse. A 2023 report by Veritas Technologies estimated that unstructured data — which includes image files — grows at a compound annual rate of about 23 percent in enterprise and public-sector environments. Boston's biotech corridor along Longwood Avenue, home to Brigham and Women's Hospital, Beth Israel Deaconess Medical Center, and dozens of research firms, generates enormous volumes of medical imaging data. Even modest deduplication improvements in that ecosystem can free up petabytes of storage over a five-year horizon.
Mayor Michelle Wu's administration has pushed a broader open-data and digital efficiency agenda since taking office in November 2021, but city procurement records don't yet show a dedicated contract for image deduplication tooling across municipal departments. Several peer cities — including Chicago and Denver — have moved to enterprise digital asset management platforms that include automated deduplication as a core feature.
For Boston institutions looking to get ahead of the problem, archivists and IT procurement officers should start by auditing shared network drives and cloud buckets for hash-matched duplicates — a process that tools like DupeGuru and commercial DAM platforms can automate in days. The harder lift is governance: agreeing on a single file-naming convention and upload protocol before a digitization project begins, not after. At the Boston Public Library's Digital Commonwealth scale, that means working through 180 member institutions. That's a policy problem as much as a technology one, and no software patch fixes it alone.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Boston
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News