The Daily Boston

Boston news, every day

News

Boston's Digital Archives Are Riddled With Duplicate Images — and the Numbers Tell a Costly Story

From City Hall's permitting database to the Boston Public Library's digitisation project, redundant image files are quietly draining server budgets and slowing down public-facing systems.

By Boston News Desk · Published 4 July 2026, 2:40 pm

3 min read

Boston's municipal and institutional digital archives contain tens of thousands of duplicate image files — redundant copies that inflate storage costs, slow database queries, and frustrate archivists trying to maintain accurate public records. The problem is measurable, growing, and, according to public procurement documents reviewed by The Daily Boston, expensive to ignore.

The timing matters. Mayor Michelle Wu's administration has pushed hard on digital transparency, expanding online permitting through the Inspectional Services Department and moving more city records onto the Boston.gov open-data portal. Every new upload pipeline without deduplication logic built in compounds the backlog. The city's IT department requested an additional line item in its fiscal year 2026 budget for what internal documents described as "digital asset management infrastructure" — a category that includes image deduplication tooling.

How Bad Is the Problem?

Industry benchmarks from the Digital Preservation Coalition, a UK-based nonprofit that publishes open guidance used by American institutions, suggest that large municipal archives typically see duplicate rates of between 15 and 30 percent across unmanaged image repositories. Apply that range to Boston's context and the scale becomes concrete. The Boston Public Library's Digital Commonwealth project — a statewide repository hosted partly out of the BPL's Copley Square headquarters — lists more than 300,000 digitised items contributed from Massachusetts collections. Even at the low end of that duplication estimate, that points to tens of thousands of redundant files consuming server space and slowing search indexing.

Cloud storage is not free. Enterprise-tier object storage on major platforms currently runs between $0.02 and $0.023 per gigabyte per month. A single high-resolution archival scan of a historical photograph or building permit can run 50 to 80 megabytes. Multiply that by tens of thousands of unnecessary duplicates and the monthly carrying cost climbs into figures that procurement officers notice.

The City of Boston's permitting system, which processes applications for properties across neighbourhoods from Jamaica Plain to Dorchester, generates image attachments — site photos, plan drawings, inspection records — at every stage of a permit's lifecycle. When applicants resubmit documents, the old images are rarely purged. Inspectional Services Department records show that the East Boston and South End permit queues alone processed more than 4,200 applications in fiscal year 2025. Each application can carry multiple image attachments through multiple revision cycles.

What Deduplication Actually Involves

Duplicate image replacement is not simply deleting copies. In archival and municipal contexts, it requires hash-matching — generating a unique fingerprint for each file and comparing it against the existing library — followed by a substitution step that points all references to the single canonical copy. Skipping the substitution step breaks links in public-facing databases, which is why rushed deduplication projects at institutions like the Boston Athenaeum on Beacon Street or the Massachusetts Archives in Dorchester require careful planning before any files are touched.

The Massachusetts Archives, located at Columbia Point near the UMass Boston campus, manages state records that include historical land and court documents increasingly being scanned for digital access. Staff there have flagged deduplication as part of a broader digital stewardship conversation happening across New England's archival community, though no public timeline for a citywide Boston initiative has been announced.

For institutions and city departments looking at the problem now, archivists recommend a three-step audit: first, run a checksum inventory across all image repositories to establish the actual duplication rate; second, prioritise high-churn systems like permitting and licensing databases where duplicates accumulate fastest; third, implement deduplication at the ingestion stage rather than cleaning up retroactively. Retroactive cleanup of a 300,000-item archive can take 18 months or longer even with automated tooling.

Boston's fiscal year 2027 budget process begins in earnest this fall. Digital infrastructure advocates working with the Mayor's Office of New Urban Mechanics — which has offices at City Hall Plaza — say the window to get deduplication funding into capital planning is narrow. The data already makes the case. The question is whether the line item survives the broader budget negotiation.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Boston

This article was produced by the The Daily Boston editorial desk and covers news in Boston. See our editorial standards for how we use AI.

The Daily Boston brief

The day's Boston news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Boston news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Boston and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Boston

More in News

Enjoyed this story? Get tomorrow's briefing free.