The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Are Staggering

A quiet data crisis inside the city's public institutions reveals how redundant image files are eating storage budgets and slowing access to historical records.

By Boston News Desk · Published 4 July 2026, 2:43 pm

4 min read

Boston's Digital Archives Are Drowning in Duplicate Images — and the Numbers Are Staggering
Photo: Photo by Hande Naz Kavas on Pexels

Boston's public libraries, municipal agencies, and university digital archives collectively hold tens of thousands of duplicate image files — redundant scans, re-uploaded photographs, and copied assets that consume server space, inflate IT costs, and make it harder for researchers to find what they're actually looking for. The problem is measurable, and the numbers are not small.

The issue has come into sharper focus this year as institutions across the city undertake mandatory digitization compliance reviews tied to Massachusetts's Public Records Law, which requires state and municipal bodies to maintain accessible, non-redundant digital archives. For Boston's sprawling network of public-facing repositories — from the Boston Public Library's Digital Commonwealth platform on Boylston Street to the City of Boston Archives on City Hall Plaza — the annual storage bills have climbed steadily even as hard drive prices fall, a paradox that points directly at unchecked duplication.

What the Data Actually Shows

Industry benchmarks from library and information science research suggest that large institutional image repositories typically contain duplicate rates between 15 and 30 percent of total file counts when no automated deduplication protocol is in place. For a collection running 500,000 image files — a realistic figure for a major urban public library system — that translates to somewhere between 75,000 and 150,000 redundant files sitting on servers, each one accruing storage costs and cluttering search results.

Cloud storage for institutional image archives typically runs between $0.02 and $0.05 per gigabyte per month depending on the provider and redundancy tier. A single high-resolution archival scan of a historical document can run 50 to 150 megabytes. Do the arithmetic on 100,000 duplicate files at an average of 80 megabytes each and you're looking at roughly 8 terabytes of redundant data — costing institutions anywhere from $160 to $400 per month, or up to $4,800 annually, just to store files nobody intentionally wants twice.

Northeastern University's library system on Huntington Avenue and the Harvard-affiliated Boston Athenaeum on Beacon Street both participate in regional digitization consortiums that have flagged deduplication as an unresolved operational gap in consortium-wide planning documents circulated to member institutions in 2025. Neither institution has publicly disclosed its specific duplicate rates.

Why Boston's Institutions Face a Steeper Climb

Boston's density of universities and cultural institutions — there are more than 35 colleges and universities within the city and its immediate suburbs — means digitization projects have proliferated rapidly, often with minimal coordination between neighboring archives. The Digital Commonwealth consortium, administered through the Boston Public Library and serving more than 180 contributing organizations statewide, ingests thousands of new image records monthly. Without mandatory hash-based deduplication checks at the point of upload, identical or near-identical files enter the system from multiple contributing institutions independently digitizing the same historical collections.

Jamaica Plain's Center for Collaborative Education and Dorchester's Dudley Branch Library — both of which have received city funding under the Wu administration's neighborhood digital equity initiatives — face a version of this problem at a smaller scale. Community digitization projects undertaken with volunteer scanners frequently produce multiple versions of the same photograph at different resolutions, all uploaded without a formal review workflow.

The practical consequences go beyond server bills. Duplicate images degrade the quality of keyword search results, force archivists to spend manual review hours identifying redundant records, and in some cases create confusion about which version of an image is the authoritative scan — particularly relevant when those images are historical photographs used in legal proceedings or academic citations.

The fix is neither cheap nor simple. Automated perceptual hashing tools — software that generates a unique fingerprint for each image and flags near-matches — are available commercially and as open-source packages, but deploying them across legacy systems requires IT staff time and, in some cases, database restructuring. Institutions that have completed deduplication projects report one-time labor costs ranging from several thousand dollars for small collections to well over $50,000 for archives exceeding one million files.

For Boston's archivists and IT managers, the practical next step is a collection audit before any new digitization contracts are signed. Establishing a deduplication checkpoint at the point of ingest — rather than cleaning up after the fact — is the approach now recommended in the most recent edition of the Digital Public Library of America's metadata application profile, updated in late 2024. The math, at least, is on the side of acting early.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.