Boston's publicly accessible digital archives have a problem that has quietly frustrated researchers, librarians, and web developers for years: thousands of duplicate images clogging databases, slowing search tools, and inflating storage costs. This week, two of the city's major cultural institutions confirmed they are deepening efforts to fix it.
The push matters right now because a July 1 deadline tied to a federally backed digitization grant — administered through the Institute of Museum and Library Services — required participating institutions to submit updated metadata and de-duplication reports. For Boston, that means the Boston Public Library's Digital Commonwealth program and the Museum of Fine Arts both face scrutiny over how they manage image redundancy in collections that are freely accessible to the public.
What Happened This Week
The Boston Public Library, whose Digital Commonwealth platform on Boylston Street hosts more than 1.5 million digitized items from institutions across Massachusetts, has been running a software audit since early June. The audit uses hash-matching algorithms to flag identical or near-identical image files across partner collections. Librarians at the BPL's Central Branch confirmed the process is ongoing, though the library has not released final numbers. Digital Commonwealth serves roughly 200 partner organizations statewide, from Framingham to Provincetown, which means a single photograph can arrive in the database multiple times through separate institutional uploads.
At the Museum of Fine Arts on Huntington Avenue, the issue has a different shape. The MFA's open-access image portal — which the museum expanded in 2020 to include tens of thousands of works in the public domain — has accumulated duplicate entries partly because of batch uploads done during the pandemic, when staff were working remotely and cross-referencing was harder. The MFA has not disclosed the precise number of duplicate records, but similar institutions that completed comparable audits have reported redundancy rates of between 8 and 15 percent in large unmanaged digital collections, according to published findings from the Digital Public Library of America.
The practical stakes are real. Storage costs for cloud-hosted image libraries have climbed sharply. Amazon Web Services S3 storage, a common solution for mid-size cultural institutions, runs roughly $0.023 per gigabyte per month — modest on its face, but significant when an archive holds hundreds of terabytes and a meaningful slice of those files are exact copies. For institutions operating on tight municipal or endowment budgets, trimming duplicate load translates directly into dollars that can go toward new acquisitions or public programming.
What Researchers and Developers Are Saying
The problem is not purely administrative. Graduate students and independent developers who pull data from Digital Commonwealth and the MFA's API for research projects at Northeastern University and MIT have long flagged the issue in developer forums. Duplicate image records return misleading results in searches, skew metadata analysis, and break automated workflows that rely on unique identifiers. Northeastern's NULab for Texts, Maps, and Networks, based in Holmes Hall on Huntington Avenue, has documented the downstream effects of messy image metadata in at least two published research papers on digital humanities methodology.
The timing also intersects with Mayor Michelle Wu's broader push to expand digital access as part of Boston's technology and innovation agenda. The city's Office of New Urban Mechanics has encouraged cultural partners to treat data quality as a prerequisite for any new public-facing digital tool. Messy archives undermine that goal before a single app is built.
For institutions still mid-audit, the path forward involves three steps: completing the hash-matching sweep, manually reviewing flagged records where automated tools are uncertain, and establishing upload protocols that prevent future duplication. The BPL's Digital Commonwealth team has indicated it expects to publish updated collection statistics by September. The MFA has not announced a public timeline.
Researchers who rely on either platform should check collection update logs over the coming weeks — some records may temporarily disappear or be merged, which can affect saved links and citation trails. Anyone working on a time-sensitive project that draws from either archive would be wise to download local copies of the specific images they need before the cleanup concludes.