Food for (AI) thought and the library initiative improving AI’s digital diet

On June 12, 2025, nearly one million digitized public domain works were released for the purposes of improving the data used to train artificial intelligence, or AI. The collection, secured by the Institutional Data Initiative (IDI) within the Harvard Law School Library, includes works written in 254 unique languages and dates back as far as the 15th century. This massive release is all part of the new initiative’s effort to ensure AI has open access to a diet of better data.

Since the upload in June, the IDI’s institutional dataset has already claimed the top spot on AI data hub Hugging Face where it has been independently downloaded more than 45,000 times.

Navigating the massive data release took months of collaboration with several large tech firms, including Microsoft, OpenAI, and Google. In the fierce AI marketplace, initiatives simultaneously supported by competing corporations are few and far between. However, Harvard Law’s public domain data release represents a rare opportunity for mutual support.

Assistant Dean for Library and Information Services at Harvard Law School Amanda Watson says the release of Harvard’s digitized collection is only the beginning. According to Watson, IDI has already begun collaborating with other knowledge institutions, like Boston Public Library, to help navigate the release of additional digitized collections.

In a conversation with Harvard Law Today, Watson discussed IDI’s mission and, with AI perched on the precipice of ubiquity, the importance of ample dialogue between experts in information and technology.

Harvard Law Today: What is the Institutional Data Initiative? What is its mission and how can it be achieved?

Amanda Watson: The Institutional Data Initiative, or “IDI,” is a collaboration that began here in the Library Innovation Lab within the Harvard Law School Library. Its general goal is to build community between the people, like librarians, working to preserve knowledge and the technologists deploying that knowledge in new, innovative ways through technology like AI. IDI works with libraries to publish their collections as data as a way of catalyzing this bridge-building. I have only been at Harvard for a year, and my appointment essentially coincided with the start of IDI; that is, it was already in the works and quickly became one of my top priorities. Early on, I had amazing conversations with JZ [Jonathan Zittrain, Director of the Berkman Klein Center for Internet & Society] and Greg [Leppert, Executive Director of IDI, and Chief Technologist of the Berkman Klein Center for Internet & Society] about AI’s role in librarianship and information. Personally, from the time I began contributing to IDI, my work has focused on coalition building and community development. Meeting with individuals from other knowledge institutions to compare notes and share our strategies has been extremely valuable, and will help ensure this initiative continues to succeed.

HLT: Why is this a critical moment for IDI?

Watson: One of the reasons it’s difficult to describe what IDI does in just a few words is because we are at a vanguard moment for the ubiquity and proliferation of AI in our daily lives. There is significant disparity today among those in law, academia, and knowledge institutions when it comes to their understanding of AI. Collectively, we all have very different perspectives about AI’s meaning in our lives, its potential uses, and the quality of the information it provides. In a way, IDI is an attempt to seize this moment to look deeply at information systems and AI technology to ensure these layers are communicating with one another at a very high level. Personally, I consider this part of IDI’s objective especially important and urgent because, at this stage, we still have an opportunity to ensure meaningful collaboration between the worlds of technology and information.

The other reason we’re committed to doing this work now is because we know commercial players are already approaching libraries for access to our collections, or simply scraping material from our websites and often overwhelming our systems. As we work on making our relationships with these companies more productive, we want to avoid a situation where its every library for itself in those conversations. If libraries can come together to assert our leadership and our terms for working with these companies, we have a much better chance of meeting the moment in a way that benefits everyone.

HLT: Why is it important to train AI on data from libraries and other knowledge institutions? Who benefits?

Watson: On a basic level, AI needs to consume information to be good at what it does. So, who better to tell it how to consume information than information professionals and scientists? The marriage of technology and information is the key to improving AI and maximizing its benefits to our world. A lot of people misconstrue that connection and think, “What does a library have to do with AI? AI is technology, libraries are for storing books.” But books are just one little sweet part of what libraries do.

At their core, libraries are custodians of information, and we have to acknowledge that we know very little about the information that has been used to train AI to this point. When using AI, we often don’t really know what we are looking at because even the developers don’t know where it came from. Developers have simply used what is available and, until now, much of that training data has not been from identifiable and traceable source materials. Courts have loosened restrictions on access to data that has already been digitized or digitally native data from sources like comment sections on the internet, but that same permissive attitude was not initially applied to the data housed in our libraries.

To this point, the quality of data available for AI training has been a mixed bag. So, IDI is focused on injecting a wide range of high-quality information into this process. By training on large swaths of open datasets from libraries, AI takes in better information, which also helps significantly improve the technology’s ability to differentiate between competing ideas. At the scale of data that libraries can provide, AI will not just simply receive better information to base its work off of — it will actually help AI function better.

“Technology needs libraries just as much as we need them.”

HLT: Is retroactively vetting the questionable or low-quality data that AI has trained on up until this point worth pursuing? What about vetting data going forward?

Watson: To me, it’s less about vetting and more about openness. So long as there is transparency in the data, the world will have access to the insight and understanding to improve and recalibrate. Based on how we have learned to seek information in the last 20 years, we have collectively become less concerned with knowing the exact sources of our information. Even though there has been a dearth of transparency in the data AI has trained on to this point, I think there is room for optimism. People have begun to pay attention and are demanding more transparency. You will see developers like Open AI provide sources to their data and allow users to track down where the information came from. I think it’s been interesting to watch people test AI about things they know deeply and confidently, and then test AI about things they do not know deeply and confidently. That signals to me that we are not opening the source up enough. At the moment, we still do not really know what we are looking at. Exposing underlying sources will be critical to ensuring transparency, which is an integral part of IDI’s mission.

HLT: In the context of the legal profession, with a constant flow of new case law and evolving precedent, is AI good enough to help? How should lawyers and law students think about AI?

Watson: I think we should view AI like a good colleague, like a next-door neighbor to your office. We also need to acknowledge that its capabilities are extremely difficult to define with certainty because it’s always improving. Have I seen AI perform well in scope and scale of law? Absolutely. Have I seen it perform poorly? Absolutely. From an ethical standpoint, I expect that we will still find that people want human interaction at the base of their legal representation. In that way, we can think of AI as a paralegal, a very sharp intern, or a great coworker who can help us comprehend issues, point us towards solutions, and improve the efficiency of our work. I see value in AI for the whole legal system — attorneys, firms, judges, etc. — and also, critically, the potential client. Apart from assisting with actual legal research, AI might be particularly useful for those who are deciding if they need a lawyer by assessing whether people with similar needs typically handle the work themselves or hire an attorney.

HLT: What are some of the biggest misconceptions about AI training and open access to data? When the media reports on AI, what are the most glaring omissions?

Watson: What I see missing in these stories is information. On a fundamental level, how we treat information will say a lot about the next 100 years of our society going forward. There is a tendency to view AI as this new, mysterious technology and use it as borderline clickbait. Disinformation is just a plague right now, it’s just so pervasive. So, I would like to counter the notion that libraries are antiquated brick-and-mortar institutions used to house old books. Libraries are portals to information and librarians are the custodians of information irrespective of the format used to memorialize it. Technology needs libraries just as much as we need them.

HLT: Competing tech companies Microsoft and OpenAI have both supported IDI’s mission. How difficult was it to get them both on board?

Watson: I have personally found working with the people from OpenAI very easy. They seem generous, receptive, hungry for knowledge, and very fair. They have committed significant resources to soliciting the opinions of librarians and knowledge institutions about information ethics. Thus far, they have been invaluable partners that deeply care and consistently put their money where their mouth is. They treat us like information experts, not book hoarders, and they listen to what we have to say without trying to dictate what information should look like. Their mutual support was not surprising because they benefit from IDI achieving its goals, as does our society. I think this is one of those rare circumstances where altruism and business interests are in lockstep. Better functioning AI is better for business, and better for the world. At IDI’s recent summit, it was fascinating to hear all these librarians and information technologists discuss big problems in the same room and strategize ways to overcome them. We will be working together to address questions about environmental impact, intellectual property, fairness to authors, and the future of publishing. It’s a privilege to be part of these important conversations.

“I think this is one of those rare circumstances where altruism and business interests are in lockstep. Better functioning AI is better for business, and better for the world.”

HLT: What’s next for IDI?

Watson: IDI is going to continue publishing new datasets in collaboration with a growing list of institutions, and doing research that comments on the quality and ethics of AI training. This work will help drive critical conversations about the foundation that this innovative technology will be built upon. We want to be initiating these conversations for the benefit of other stewards of information as they pursue their own goals. Collaborating and drawing blueprints other institutions can utilize will continue to be at the center of what we do. We will also be pursuing digitalization opportunities for the purposes of both access and preservation. We are working with amazing partners to help catalyze their efforts. Harvard Law School Library’s Nuremberg project is a fantastic example. As humans in modern society, we are perpetually overwhelmed. So, I am very pleased that such a major part of IDI’s mission is to initiate these conversations and pursue opportunities for collaboration that otherwise would be drastically underprioritized.

Want to stay up to date with Harvard Law Today? Sign up for our weekly newsletter.

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Teaching & Learning