The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering

A growing reckoning with redundant photo files is costing city agencies, universities, and cultural institutions real money and real storage space.

By Boston News Desk · Published 4 July 2026, 3:51 pm

4 min read

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Are Staggering
Photo: Photo by Mohammed Abubakr on Pexels

Boston's public institutions are sitting on a data problem hiding in plain sight. Duplicate images — identical or near-identical photo files stored multiple times across shared servers and cloud platforms — account for an estimated 20 to 30 percent of total digital storage consumption at mid-size organizations nationwide, according to data management research published by the Storage Networking Industry Association. For a city with as many universities, hospitals, and municipal agencies as Boston, that translates into millions of dollars in avoidable infrastructure spending every year.

The timing matters. The City of Boston's fiscal year 2026 IT budget, passed by the City Council earlier this spring, allocated roughly $47 million to technology operations across municipal departments. As agencies digitize more records — building permits in Roxbury, zoning filings in South Boston, public health data from Mattapan — the volume of stored image files is climbing fast. Without systematic deduplication, that cost compounds.

Where the Problem Shows Up Locally

Two Boston institutions illustrate the scale clearly. The Boston Public Library, headquartered on Boylston Street in Copley Square, has been digitizing its print and photographic collections for over a decade through its Digital Commonwealth program, a statewide repository serving more than 180 member institutions. Archivists there have acknowledged publicly that managing file redundancy across that many contributing organizations is one of the persistent technical challenges of shared repository work — the same image scanned at different resolutions by different institutions ends up stored separately, multiplying storage load without adding informational value.

Northeastern University's library system on Huntington Avenue faces a similar challenge. Its digital preservation infrastructure supports research collections across multiple colleges, and graduate programs in information science there have used institutional deduplication efforts as live case study material. When a single 30-megabyte RAW image file gets duplicated even ten times across departmental drives, the arithmetic turns ugly fast.

The MBTA, which has been under intense scrutiny over its technology spending since the Federal Transit Administration lifted its safety oversight agreement in 2024, maintains thousands of security and operational camera feeds. Footage archiving, when not carefully managed, is one of the fastest generators of duplicate file bloat in any transit system of comparable size.

The Data Behind the Dollar Signs

Cloud storage pricing gives the clearest sense of what duplication actually costs. Amazon Web Services standard S3 storage runs approximately $0.023 per gigabyte per month as of mid-2026. A municipal department storing 10 terabytes of image files — not an unusual figure for a mid-size city agency with active permitting or surveillance functions — pays around $230 a month just for that tier. If 25 percent of those files are duplicates, the agency is burning roughly $57 a month on redundant data. That number scales brutally: across a dozen departments, it becomes nearly $700 a month, or more than $8,000 a year, purely on preventable redundancy.

The broader institutional picture is worse. A 2023 report by Veritas Technologies estimated that unstructured data — which includes image files — grows at a compound annual rate of about 23 percent in enterprise and public-sector environments. Boston's biotech corridor along Longwood Avenue, home to Brigham and Women's Hospital, Beth Israel Deaconess Medical Center, and dozens of research firms, generates enormous volumes of medical imaging data. Even modest deduplication improvements in that ecosystem can free up petabytes of storage over a five-year horizon.

Mayor Michelle Wu's administration has pushed a broader open-data and digital efficiency agenda since taking office in November 2021, but city procurement records don't yet show a dedicated contract for image deduplication tooling across municipal departments. Several peer cities — including Chicago and Denver — have moved to enterprise digital asset management platforms that include automated deduplication as a core feature.

For Boston institutions looking to get ahead of the problem, archivists and IT procurement officers should start by auditing shared network drives and cloud buckets for hash-matched duplicates — a process that tools like DupeGuru and commercial DAM platforms can automate in days. The harder lift is governance: agreeing on a single file-naming convention and upload protocol before a digitization project begins, not after. At the Boston Public Library's Digital Commonwealth scale, that means working through 180 member institutions. That's a policy problem as much as a technology one, and no software patch fixes it alone.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.