If you’ve ever looked up a vintage newspaper article for research, you may not have appreciated the amount of effort it took to make that old story available.
On Monday the Boston Public Library sponsored an online event, “Beyond the Headlines: AI and Historical Newspapers,” that shed light on the complexities of news preservation. The two featured speakers — Molly Hardy, project lead for the Public Data Project at Harvard Law School’s Library Innovation Lab, and Greg Leppert, executive director of the Institutional Data Initiative at the Harvard Law School Library — focused on the challenges of preserving historical newspapers, and of using AI to restore them digitally.
After decades of working with old newspapers, from reading rooms to computer screens, Hardy said she’s learned one defining fact about them: They’re messy. “Messy to handle, messy to catalogue, messy to read, messy to digitize, messy to OCR [referring to Optical Character Recognition technology], messy to word search, messy to appreciate.”
Historical newspapers, she noted, could be short-lived, quickly discarded, and used for other purposes, “like wrapping up dead fish.” Yet newspapers distinguish themselves, she said, “because they are so crucial to our understanding of the past while simultaneously being one of the most difficult sources to preserve and to make accessible.”
The effort to list and collect newspapers is “almost as old as the nation itself.” She noted an advertisement in The Massachusetts Spy from 1810, soliciting newspapers for what would become the American Antiquarian Society in Worcester. Two centuries later, the society completed a directory of American newspapers before 1820, but “things get iffy” after that, thanks to the rise of industrial print, which sharply increased the number of newspapers.
“Not only do we not know every newspaper that came off the presses in this country, we also have an incomplete collection of those that do survive.” The best resource, she said, is the Directory of U.S. Newspapers, produced by the Library of Congress and the National Endowment for the Humanities.
A further example of “messiness” was the inequality of ethnic and racial representation in newspapers that have been preserved. Large digital databases, notably the Library of Congress’s Chronicling America, have tended to prioritize long-running newspapers of record, while the ethnic press was often produced rapidly and briefly, due to “all sorts of mitigating circumstances.” Thus, the majority press got prioritized. This changed in 2022, she said, when Chronicling America made a concerted effort to prioritize “underrepresented” materials, moving away from the emphasis on longtime press runs.
Finally, she noted, changes in colloquial language over time can frustrate efforts to get clear pictures through keyword searches. “A researcher must therefore familiarize themselves with the terminology of the period in which they are studying, before turning to the keyword search as if it was an oracle waiting to reveal the truth of the past.”
As part of the Institutional Data Initiative at Harvard Law, Leppert helped oversee the release of a collection of a million public domain books that were scanned at Harvard Library as part of a Google Books project.
A project digitizing a million newspaper pages is underway in collaboration with the Boston Public Library. Exploring the complexity of that process, he said, “Data is feeding into AI, and AI is feeding into data — this could either look like the serpent eating itself or like a virtuous cycle. … We are trying to get involved and make sure that this is a virtuous cycle, this actually lifts the work of everybody who’s working at knowledge institutions.”
He challenged the preconception that you could simply “shove the newspapers into an AI model” and come out with accurate scans. The reality is far more complicated.
As an example, he cited a 2023 Harvard project, “American Stories,” that used an AI model known as YOLO (for “you only look once”) to scan historical newspapers, breaking them up into individual content. The process has since been refined and now involves dividing pages into component parts for scanning and classifying the content afterward.
Likewise, advances in OCR have made it possible to restore text from original newspapers more accurately. In fact it sometimes does the job too well, correcting errors such as missing spaces that occurred in the original copy. “Is that helpful to us? Actually it’s probably not. If you’re a digital humanist who wants to study the evolution of language, you want to know when someone forgot that space. … That’s how language evolves, and we need to be aware of that as we use these tools.”
Responding to an audience question about National Endowment for the Humanities funding for such efforts, moderator Jessica Chapel of the Boston library noted that it had its funding restored after federal cuts last spring.
Hardy, a former senior officer at the endowment, expressed optimism that this work would go on. “One of the many great things about newspapers is that they’re everything to everyone — their content, where they came from, they’re so capacious, and that goes back to the messiness. There’s something for everybody in large newspaper aggregation sets, so I am cautiously optimistic that this work will be able to continue.”
Want to stay up to date with Harvard Law Today? Sign up for our weekly newsletter.