In less than a decade, artificial intelligence has evolved from a promising idea to a fully functioning engine driving changes in how people live and work across the globe. Engines, of course, need fuel, and the vast quantities of data used to train AI are powering these online innovations.

At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models.

“Libraries and other stewards of humanity’s aggregated knowledge can think in terms of centuries — preserving it and providing access both for known uses and for aims completely unanticipated,” said Zittrain, the George Bemis Professor of International Law at Harvard Law School and Vice Dean of the Harvard Law School Library.

“IDI’s aim is to address newly energized interest from those quarters in otherwise-obscure texts in ways that preserve institutions’ values. That means working towards access for all for public domain works that have remained fenced — access both for the human eye and for imaginative machine processing. The latter will require forging examples if not outright standards to facilitate the easiest and best range of uses, from the current frontier model to students and scholars who wish to explore and tinker.”

Leppert spoke with Harvard Law Today to discuss IDI’s mission and explain why the data stewarded by institutions like Harvard is the key to building a better AI future.


Harvard Law Today: What is the Institutional Data Initiative?

Greg Leppert: Our work at the Institutional Data Initiative is focused on finding ways to improve the accessibility of institutional data for all uses, artificial intelligence among them. Harvard Law School Library is a tremendous repository of public domain books, briefs, research papers, and so on. Regardless of how this information was initially memorialized — hardcover, softcover, parchment, etc. — a considerable amount has been converted into digital form. At the IDI, we are working to ensure these large data sets of public domain works, like the ones from the Law School library that comprise the Caselaw Access Project, are made open and accessible, especially for AI training. Harvard is not alone in terms of the scale and quality of its data; similar sets exist throughout our academic institutions and public libraries. AI systems are only as diverse as the data on which they’re trained, and these public domain data sets ought to be part of a healthy diet for future AI training.

“AI systems are only as diverse as the data on which they’re trained, and these public domain data sets ought to be part of a healthy diet for future AI training.”

HLT: What problem is the Institutional Data Initiative working to solve?

Leppert: As it stands, the data being used to train AI is often limited in terms of scale, scope, quality, and integrity. Various groups and perspectives are massively underrepresented in the data currently being used to train AI. As things stand, outliers will not be served by AI as well as they should be, and otherwise could be, by the inclusion of that underrepresented data. The country of Iceland, for example, undertook a national, government-led effort to make materials from their national libraries available for AI applications. That is because they were seriously concerned the Icelandic language and culture would not be represented in AI models. We are also working towards reaffirming Harvard, and other institutions, as the stewards of their collections. The proliferation of training sets based on public domain materials has been encouraging to see, but it’s important that this doesn’t leave the material vulnerable to critical omissions or alterations. For centuries, knowledge institutions have served as stewards of information for the purpose of promoting the public good and furthering the representation of diverse ideas, cultural groups, and ways of seeing the world. So, we believe these institutions are the exact kind of sources for AI training data if we want to optimize its ability to serve humanity. As it stands today, there is significant room for improvement.

HLT: How did Harvard’s data sets come into existence and what kind of materials are involved?

Leppert: The Caselaw Access Project was a multi-year effort at the Library Innovation Lab, starting in 2015. Over the course of about three years, 360 years of U.S. case law was scanned, parsed, and structured into a first-of-its-kind dataset. That dataset is now the backbone of legal AI training sets. We’re now working to release roughly one million public domain books, scanned at Harvard Library during the Google Books project. Two decades ago, Harvard Library became an early participant in that project and immense effort went into not only the scanning of the books but also their selection. The fundamental goal of the project was to increase the accessibility of this information and make these works “first-class citizens” on the internet, where the books themselves would become key reference resources. Part of IDI’s mission is, in a sense, to continue in that spirit by making that information accessible via new means, in addition to Harvard Library making them available to the Harvard research community.

HLT: Can you take me through the inception of the Institutional Data Initiative?

Leppert: The IDI concept began at the Harvard Law School Library’s Library Innovation Lab. I was interested in finding ways the academic researchers around me could have an impact on the trajectory of AI. I saw a lot of researchers going to industry to work on state-of-the-art models. I saw the technological resources needed to create those models becoming increasingly expensive. But I also saw the sheer magnitude of data within academia and other knowledge institutions. I became interested in finding ways to leverage institutional data resources to ensure there would be academic involvement in the building of AI. I brought that idea to Jonathan [Zittrain] and, thankfully, he was very supportive. Amanda Watson, the associate dean of the Harvard Law School Library, as well. And of course, Jack Cushman, the director of the Library Innovation Lab, created the time and space in which it could be incubated.

HLT: What obstacles exist that could potentially prevent IDI from achieving its goals?

Leppert: While university libraries and other knowledge institutions are well-positioned to inform AI and shape its impact, resource scarcity and time constraints are significant practical concerns. The rapid rise of any technology also tends to outpace the availability of technical expertise. At the same time, there’s incentive from the builders of AI to want to engage with the data that those institutions have, and so the IDI is meant to support those institutions to help them engage. IDI is working to develop a team of data scientists and community builders who can work with knowledge institutions and demonstrate how they can make their collections available for AI and for training. By helping other institutions identify the most effective and efficient ways to further their missions, we can help mitigate the inevitable challenge of limited resources. There’s still so much for everyone to learn regarding the future of AI, so part of our mission is to establish a robust forum for those critical conversations to occur.

“The momentum of AI is extremely powerful and, utilized correctly, can really amplify the missions of knowledge institutions across the world.”

HLT: Is the IDI engaging other knowledge institutions to explore opportunities for collaboration?

Leppert: Absolutely, we are currently working with Boston Public Library and are in talks with several others. With our launch, we’re hoping to build connections with as many knowledge institutions as we can. We’re data scientists who are ready and willing to help refine the data, prepare it for release, and post it on the servers. We can help strategize and advise other institutions on access mechanism options. We are ready and willing to do considerable leg work and simply need institutions that are interested in participating to reach out to us. We’re willing to do the rest.

We are also planning a spring symposium to bring together these institutions and begin the conversation about how we can work together. It’s meant to be as broad as possible, empowering others to release their data to the world. We’re trying to enable community practices to evolve among the institutions and for those to be informed by each of their missions and their goals. The momentum of AI is extremely powerful and, utilized correctly, can really amplify the missions of knowledge institutions across the world.

HLT: How do AI companies currently benefit from public work? How should the public be benefitting from the work of AI companies?

Leppert: The entire AI community benefits immensely from historical investments into public knowledge institutions because that data provides much of the foundation for AI models. Without public work, we simply would not have the same level of high-quality information needed to fuel the advanced models we see today. We have an opportunity to use those public investments — some of which were made centuries ago — to ensure AI benefits as wide a reach of humanity as possible. It’s a great time to have invested in knowledge stewardship, and it’s a great time to reinvest in it as we head into an AI future.


Want to stay up to date with Harvard Law Today? Sign up for our weekly newsletter.