The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Full of Duplicate Images. Officials and Experts Say That's a Bigger Problem Than It Sounds.

From City Hall records to university research databases, Boston institutions are grappling with what to do when the same photograph appears twice — or a hundred times.

By Boston News Desk · Published 4 July 2026, 2:43 pm

3 min read

Boston's Digital Archives Are Full of Duplicate Images. Officials and Experts Say That's a Bigger Problem Than It Sounds.
Photo: Photo by Abdullah Almutairi on Pexels

The Boston City Archives holds tens of thousands of images. Some of them exist in triplicate. A growing number of technologists, municipal records managers, and university librarians are pushing Boston institutions to treat duplicate image replacement — the systematic process of identifying and retiring redundant digital files — not as a housekeeping chore but as a governance issue with real cost consequences.

The timing matters. Mayor Michelle Wu's administration has been expanding digital service delivery across city departments, a push that has accelerated data storage demands at the same time that the MBTA, Boston Public Schools, and several Roxbury-based community organizations have moved records management online. Each migration tends to compound existing duplication problems rather than solve them.

Why Redundant Files Become Expensive Fast

Storage is not free. According to Gartner's 2025 infrastructure cost benchmarks, redundant and obsolete data typically accounts for between 30 and 40 percent of an organization's total stored data volume — a figure that translates directly into ongoing cloud and server costs. For a city agency or a research university running petabyte-scale storage, that overhead adds up. Boston University's Mugar Memorial Library and Northeastern University's Digital Repository Service — both of which manage large photographic and document collections — have each invested in deduplication tooling in recent years, though neither institution has publicly disclosed the scale of their cleanup efforts.

The problem is not unique to government. The Harvard Medical School's countway library system, which sits on Longwood Avenue and serves multiple affiliated hospitals, manages research imagery ranging from pathology slides to archival photographs. Duplicate image files in medical research databases carry a specific risk: if two versions of the same image circulate with slightly different metadata, a researcher pulling files programmatically may treat them as distinct data points and skew results. That concern has been raised in peer-reviewed literature on research data integrity, though no Boston institution has been publicly implicated in a documented error of that kind.

Rebecca Tiven, who manages records policy for the City of Boston's Department of Innovation and Technology — known locally as DoIT — has spoken at public panels about the city's data hygiene priorities, though her department has not issued a formal public report on image duplication specifically. DoIT's budget for fiscal year 2026 included line items for cloud infrastructure and data governance, but the city has not broken out deduplication as a standalone program cost in documents reviewed by The Daily Boston.

What the Experts Are Actually Recommending

Specialists in digital asset management say the core issue is workflow, not storage technology. The usual recommendation is a three-step approach: audit existing repositories to establish a baseline count of duplicate files, implement perceptual hashing — a technique that identifies visually identical or near-identical images even when file names differ — and establish a clear policy on which version of a duplicated file becomes the canonical record before the others are retired.

At the Boston Public Library's Norman B. Leventhal Map and Education Center on Boylston Street, staff have worked through similar questions in digitizing historical map collections. The center uses controlled vocabularies and consistent file-naming conventions to prevent duplication at the point of ingest, a model that records managers elsewhere in the city have pointed to as a practical template.

The Jamaica Plain branch of the BPL, which anchors a neighborhood where the Wu administration has been prioritizing housing production along Centre Street, also serves as a community digitization hub. Residents bring in personal photographs and documents to be scanned. Without a deduplication protocol at the back end, community collection projects like that one can quietly accumulate redundant files over months.

For city agencies and nonprofits looking to get ahead of the problem before the next budget cycle — fiscal year 2027 planning in Boston begins in earnest this fall — the practical first step recommended by digital archivists is straightforward: run a file count, sort by file size and creation date, and flag anything with an identical checksum. That costs nothing except staff time. The harder part, as anyone who has ever tried to get two city departments to agree on a naming convention will tell you, is the policy that follows.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.