The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story

From City Hall Plaza to the Fenway neighborhood, institutions are quietly grappling with a data redundancy problem that is eating storage budgets and slowing public access to records.

By Boston News Desk · Published 4 July 2026, 2:48 pm

3 min read

Boston's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story
Photo: Photo by Phil Evenden on Pexels

Boston-area institutions collectively hold tens of millions of digitized images across municipal, university, and public library archives — and a significant share of those files are duplicates, sometimes stored three or four times across different servers. That redundancy is costing money, slowing search systems, and undermining the reliability of public records at a moment when the city is pushing hard on digital transparency.

The issue has moved from the back rooms of IT departments to budget conversations at City Hall and on the campuses of Northeastern University and the Boston Public Library's Copley Square branch. Archivists and records managers across the region say the explosion of high-resolution scanning projects, many funded through federal grants between 2021 and 2024, flooded servers with unvetted image files. Deduplication — the process of identifying and replacing redundant copies with a single verified master — was rarely written into the original grant workflows.

What the Data Actually Shows

The scale of the problem is measurable. The Boston Public Library's Digital Commonwealth platform, which aggregates digitized collections from more than 160 Massachusetts institutions, crossed 1.5 million publicly accessible objects by early 2026. Internal assessments shared at a New England Archivists conference in March 2026 indicated that duplicate image rates in batch-scanned municipal collections can run as high as 18 percent. Applied to a collection of that size, that suggests roughly 270,000 objects may be redundant versions of files stored elsewhere in the same system.

Storage costs for high-resolution TIFF files — the standard archival format — run between $0.02 and $0.05 per gigabyte per month on cloud infrastructure, modest on its face. But a single uncompressed TIFF scan of a mid-20th-century city planning map can exceed 400 megabytes. Multiply that across tens of thousands of duplicated map and photograph files, and institutions are paying to store the equivalent of several terabytes of redundant data every month. Northeastern's David M. Rubenstein Rare Book & Manuscript Library equivalent, the Snell Library Special Collections on Huntington Avenue, launched a deduplication audit in January 2026 targeting its Boston neighborhood photography holdings from the urban renewal era.

The MBTA's own records division, which maintains engineering drawings and inspection photographs for the entire transit network stretching from Alewife to Braintree, identified a duplicate image problem during a 2025 infrastructure records digitization push. The agency did not publicly release specific figures, but the effort to clean up the image database was folded into a broader records modernization contract awarded in the fall of 2025.

Why Replacement Is Harder Than It Sounds

Replacing a duplicate image isn't simply a matter of deleting extra files. Archivists must verify that the retained master copy meets quality standards — correct color profile, sufficient resolution, accurate metadata — before any secondary copy is removed. If a duplicate was tagged with different descriptive data than the master, that contextual information has to be merged, not discarded. The Jamaica Plain branch of the Boston Public Library, which holds a significant local history photograph collection covering the neighbourhood's mid-century industrial waterfront along the Stony Brook corridor, ran a pilot deduplication project in late 2025 that took three staff months to process roughly 4,000 images.

The cost in staff time is substantial. Professional archivists in the Boston market earn between $52,000 and $74,000 annually, according to regional salary surveys published by the Society of American Archivists. A three-month deduplication project consuming half of one archivist's time represents roughly $10,000 to $18,000 in labor, before any software licensing for automated hash-matching tools.

Institutions planning deduplication work in the second half of 2026 are being advised by records management consultants to build automated perceptual hashing — a technique that identifies visually identical images even when file names differ — into their scanning pipelines from the outset, rather than treating it as a cleanup task. The Boston City Archives on Peabody Square in Dorchester, which manages permanent municipal records, is scheduled to release updated digitization standards in the third quarter of 2026 that are expected to address duplicate prevention requirements directly. Getting the infrastructure right now, before the next round of federal digitization grants opens, will determine whether the numbers improve or compound.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.