Boston's push to digitize decades of paper records has produced an unexpected headache: thousands of duplicate images clogging the city's digital archive systems, slowing public records requests and straining the storage infrastructure that underpins everything from permit lookups in Dorchester to deed searches near Faneuil Hall. The problem, which has grown alongside accelerating scanning efforts at City Hall and the Boston City Archives on West Roxbury Parkway, is now forcing a set of decisions that archivists and city technology officials can no longer defer.
The timing is not accidental. Mayor Michelle Wu's administration has pushed hard on open-data and transparency commitments since 2022, and the digitization effort was meant to be a flagship deliverable. But bulk scanning operations—particularly those tied to housing records in Jamaica Plain and building permits in South Boston—routinely produce multiple image files for the same document when scanners malfunction, operators re-run batches, or format conversions generate secondary copies. The result is an archive that is nominally comprehensive but practically difficult to search.
What the Backlog Actually Looks Like
The Boston City Archives holds physical and digital records dating to the colonial era, with the bulk of recent digitization concentrated on post-1980 municipal documents. Industry standards in records management suggest that duplicate image rates in large-scale municipal scanning projects can run as high as 15 to 20 percent of total files before deduplication software is applied—a figure that, projected against Boston's known scanning volumes, implies tens of thousands of redundant files across the system. No official public count of Boston's specific duplicate inventory has been released as of this writing.
The Massachusetts Secretary of State's office, which oversees public records law statewide, sets a 10-business-day response window for records requests under Chapter 66 of the General Laws. Archivists and records managers working with large duplicate-heavy databases consistently report that retrieval times climb when staff must manually verify which copy of a document is the authoritative one. For residents trying to pull building inspection records on Blue Hill Avenue or zoning decisions affecting Egleston Square, that delay is not abstract.
The Boston Public Library's Norman B. Leventhal Map Center, which has separately managed its own high-profile digitization projects, completed a deduplication and metadata audit of roughly 10,000 map images in 2023—a project that took about 14 months and required custom scripting alongside commercial software. That experience offers a rough benchmark for what city hall would face with a larger, more legally sensitive document set.
The Decisions That Cannot Wait
Three choices sit at the center of what comes next. First, city technology staff must decide whether to pursue automated hash-matching deduplication—fast, but prone to flagging legitimately distinct documents that scanned identically—or manual audit workflows, which are slower but legally defensible for records that may end up in court proceedings. Second, officials must determine which department owns the problem: the city's Department of Innovation and Technology on City Hall Plaza, or the Archives on West Roxbury Parkway. Divided ownership has historically stalled projects of this kind in mid-size American cities. Third, and most consequentially, the administration must decide how to handle the period between now and full remediation—specifically, whether public records responses during that window will flag known duplication issues or simply deliver the file as found.
Housing advocates in Roxbury and Dorchester have a direct stake in the outcome. Permit and inspection records are central to tenant-side litigation and code-enforcement complaints, and a database that produces duplicate or ambiguous returns undermines exactly the transparency the Wu administration has promoted. The MBTA's own parallel digitization of maintenance logs—relevant to the ongoing federal safety oversight still in effect as of mid-2026—illustrates how document integrity problems compound quickly when records feed into regulatory and legal processes.
A city-funded RFP for a broader digital records management overhaul was expected to move through the procurement process before the end of fiscal year 2026, which closed June 30. Whether that contract has been awarded, and what scope it covers for deduplication specifically, will be the clearest signal of how seriously the administration intends to treat the problem before it metastasizes further.