The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell the Story

From the Boston Public Library to city hall's permitting database, redundant image files are costing storage dollars and slowing down public access to records.

By Boston News Desk · Published 4 July 2026, 3:23 pm

4 min read

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell the Story
Photo: Photo by Phil Evenden on Pexels

Boston's public institutions are sitting on a data problem they can measure in terabytes and tax dollars. A growing audit trail across city agencies and major cultural repositories shows that duplicate image files — the same photograph, scan, or graphic stored two, three, sometimes a dozen times across different servers — account for a disproportionate share of ballooning digital storage costs. The problem is unglamorous, but the figures behind it are not small.

The timing matters because the city is mid-way through a broader digital infrastructure push tied to Mayor Michelle Wu's open-data commitments, and several major institutions are simultaneously digitizing large backlogs of physical records. When volumes spike fast, redundancy spikes with them. Storage that might have seemed cheap at $0.02 per gigabyte in a cloud environment multiplies quickly when you're managing collections that run into the hundreds of thousands of individual image files.

Where the Redundancy Lives

The Boston Public Library's Copley Square main branch is one focal point. The BPL's Digital Commonwealth program — a statewide portal hosted out of Boston that serves libraries, historical societies, and archives across Massachusetts — crossed 1.6 million digitized objects as of its most recent public count. Administrators there have publicly acknowledged the challenge of deduplication as collections from different contributing institutions upload overlapping scans of the same historical photographs and maps. A single 1890s photograph of Washington Street, for instance, might exist in the portal under three different contributing institution entries, each carrying its own full-resolution TIF file at roughly 50 megabytes apiece.

Across the Charles River, MIT's libraries and the Northeastern University Archives in the Snell Library on Huntington Avenue face a version of the same issue inside their internal digital asset management systems. Archivists working with large photographic donations routinely find that donors themselves have already duplicated files across personal drives before submission, seeding redundancy before the material even enters a formal repository.

At City Hall on Cambridge Street, the problem surfaces in a less glamorous context: the Inspectional Services Department's permitting and code-enforcement image database. Property inspection photos, uploaded by field officers from mobile devices since the department moved to a mobile-first workflow in 2022, frequently generate automatic duplicates when network connections drop mid-upload and the app retries. Staff who have discussed the workflow in public forum settings have described the retry mechanism as the single largest source of redundant files in the system, though the city has not released a precise count of affected records.

The Cost Calculus

Quantifying the dollar impact requires some arithmetic. Enterprise cloud storage through providers typically used by municipal governments runs between $0.018 and $0.023 per gigabyte per month under standard tiered contracts. If a mid-sized city archive is carrying 30 percent of its image library as redundant copies — a figure consistent with deduplication studies published by the Digital Preservation Coalition in 2024 — and that library totals 20 terabytes, the unnecessary monthly spend lands somewhere between $110 and $140. Modest at that scale, but the math changes when collections reach the petabyte range that statewide systems like Digital Commonwealth are approaching.

Beyond raw storage, the secondary cost is search latency and curatorial labor. Duplicate records generate duplicate metadata entries, which means cataloguers at institutions like the Massachusetts Historical Society on Boylston Street spend time adjudicating which version of a record is canonical. That labor, billed at professional archivist rates of roughly $28 to $45 an hour depending on seniority, adds up across an annual collection cycle.

The deduplication software market has matured considerably. Tools built specifically for library and archive environments — products like Rosetta from Ex Libris and open-source alternatives in the Samvera ecosystem — can now run automated hash-matching against existing collections before ingesting new material, flagging probable duplicates for human review rather than silently storing them. Several Massachusetts institutions piloted hash-based deduplication workflows in 2025 as part of a LYRASIS-supported regional initiative.

Institutions sitting on this problem should treat the July fiscal quarter-end as a practical deadline for running a baseline storage audit. Identifying the share of image storage consumed by files with identical checksums costs very little and produces the kind of concrete number that makes budget conversations with administrators straightforward. In Boston's case, those conversations are already happening — the data just needs to catch up with the urgency.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.