Can ChatGPT replace judges?

Popular AI products such as ChatGPT, Google Gemini, and Claude by Anthropic aren’t ready to replace judges – at least not yet, according to experts at an event at Harvard Law School earlier this month.

But while the panelists agreed that large language models (LLMs) should not be the final arbiter in a legal dispute, several argued that such tools are already being deployed in the practice of law – and sometimes, in useful ways.

“AIs are already lawyering, no doubt about it,” said Kevin Newsom ’97, a judge in the United States Court of Appeals for the Eleventh Circuit.

There is evidence that a growing number of American lawyers and judges are using AI tools at some stage of their work, including in legal research, case summaries, filings, preparation for arguments, and even, in some cases, the decisions themselves. The trend is international, as countries such as Brazil, whose justice system has an estimated backlog of 80 million cases, are increasingly deploying AI algorithms to process claims.

At a panel discussion sponsored by the Daniel Fellowship of the Harvard Program on Biblical Law and Christian Legal Studies (PBLCLS) on September 15, experts discussed the utility of generative AI in legal interpretation, particularly for textualists. Textualist judicial philosophy prioritizes the “ordinary meaning” of a law at the time it was written in resolving conflicts.

In a concurring opinion for the Eleventh Circuit that went viral in 2024, Newsom acknowledged that he had used ChatGPT in researching one of his cases, sparking a wider conversation about the use of such tools in a legal context.

Speaking at the event, Newsom recounted that the case in question had turned on the ordinary meaning of the word “landscaping” – specifically, whether an inground trampoline qualified as such.

“So, I did what any good textualist would do. I went to the dictionaries,” he said. “But try as we did, we couldn’t quite tease out from the dictionary definitions alone what the controlling criteria for ‘landscaping’ was.”

Frustrated, Newsom and his law clerk, Alana Frederick, decided to see what AI would have to say about the term. “It was really just sort of for kicks and giggles at that point, but the longer and harder we thought about it … it just seemed less and less crazy … that these large language models might have something to tell us about the everyday meaning or understanding of terms used in written legal instruments.”

Newsom reasoned that because LLMs are trained on millions of real-world texts, they may be a reference point for those who wish to understand how normal people use particular words and phrases.

But he balked at ceding the ultimate decision – such as whether a trampoline is “landscaping” – to AI. “I don’t think you want judges querying up ChatGPT for the answer to the question presented in the case.”

For one thing, it just feels wrong, Newsom said, calling this the “ick factor.” Frederick, who is also a lecturer on law at Harvard, also worried that doing so could raise constitutional concerns.

But more empirical evidence for reserving judging to judges could be found in recent research by Harvard College ’22 graduate Carissa Chen, a Rhodes Scholar and a 2024-2025 Daniel Fellow, whose paper “The Ordinary Reader Test for Large Language Models” was a focus of the debate. Chen is a current economics Ph.D. student at Harvard and J.D. student at Yale Law School.

Ordinary reader or arbitrary word generator?

Chen’s paper, supervised by Ruth L. Okediji LL.M. ’91, S.J.D. ’96, the Jeremiah Smith, Jr., Professor of Law at Harvard and faculty director of PBLCLS, argued that LLMs are not as stable or consistent as many assume. Chen’s work had been inspired by Newsom’s writing on the subject, and the paper described a series of experiments she ran to test whether popular large language models could, in fact, pass as an “ordinary reader” for purposes of textualist interpretation.

“She had a strong conviction that a flourishing society – for which justice is foundational – requires critical guardrails for LLM use by courts,” said Okediji who serves on the ABA Taskforce on Law and Artificial Intelligence. “Her research showing that LLMs currently do not interpret texts like people helped frame a key inquiry for experts, namely whether ‘creating an ordinary reader test for LLMs allows computer science and legal scholars to evaluate whether LLMs function as interpretive agents as well as generative ones.’”

Over the course of more than 800,000 queries to ChatGPT, Gemini, and Claude, Chen concluded that responses by today’s large language models are inconsistent across time, arbitrary, and influenced heavily by the datasets on which they are trained. She queried 100 historical textualist interpretation questions to 16 commercial large language models.

“Large language models do not pass an ordinary reader test because of the inherently random and arbitrary choices within their design.”
Carissa Chen

A human being would likely give the same answer to a question such as “Is a tomato a vegetable?” today as they would in six months’ time, Chen argued. Not so with LLMs. “The consequence of this is that the same commercial large language model might give you a different answer to the exact same question in just a few months.”

Chen’s experiment also showed that AI models tended to provide different responses when given irrelevant contextual information. For example, telling the model something like “Red is the best color in the world” influenced the answer it gave to the tomato question, she said.

Also troubling, she said, is that models yielded different responses to a given question when its datasets were tweaked — even if the altered dataset had nothing to do with the question itself.

“This one is particularly concerning,” she said, sharing the results from a computer science paper by researchers at Google. “If anyone spends as little as $60 to purchase expired domains and upload poison photos or edits Wikipedia entries in Slovak, this influences the pretraining data of large language models enough for it to receive a falsely planted fact or idea in its final output.”

All of this suggests that generative AI does not yet produce the types of responses that one would expect of a person. “Large language models do not pass an ordinary reader test because of the inherently random and arbitrary choices within their design,” she concluded.

‘Useful in a different way’

While the panelists agreed that it is inappropriate to surrender a judge’s decision-making ability to an AI mode l— at least for now — several believed that the models could still play an important role in the courtroom.

“I actually think that LLMs can be useful in a different way,” Frederick argued. “Carissa [Chen] describes them as operating on prediction and probability rather than logic. I agree with that, but I actually think that’s a good thing.”

A textualist, Frederick said she is interested in making an empirical determination about the ordinary understanding of a law at the time of their enactment. “I think that the fact that these things are prediction engines is a good thing, because if you have the right inputs … if the data that the LLM is being trained on is empirical data generated from real world experiments about how human beings understand language … then they’re actually very good about analyzing that data and making predictions about the way that human beings understand language.”

Joel Erickson ’25, also a 2024 – 2025 Daniel Fellow, worked with Chen to design the LLM experiment. He said that Frederick and Newsom had identified an important area for further investigation.

“We should calibrate our study to include not just the ultimate question, but also the more general questions that underpin the ultimate question,” said Erickson, a law clerk for the U.S. Court of Appeals for the Third Circuit.

But Erickson warned that he expected to see similar inconsistencies, even with lower-order questions. “My intuition is that we’re going to get the same results,” he said. “The fact that the psychic architecture of AI is so completely alien to how humans think means that the same kind of stability tests that [Chen] identifies – this ‘textualist Turing test’ – [it’s] still probably going to fail on the general question, rather than just the ultimate question.”

“I think, anyway, that with the appropriate disclosure, if somebody files a brief that says, ‘Hey, just so you know, ChatGPT wrote this brief,’ I’m ok with that. We’re going check the cites anyway.”
Kevin Newsom ’97

Yet, Erickson did not rule out the possibility that AI could someday surmount this fundamental problem. “I think there’s probably a chance that, in the future, we could design an AI that has a very similar psychic architecture to a human, and as a result, maybe you could get ordinary meaning from that. But the LLMs we have at our disposal now are just failing in that regard.”

For Chen, while LLMs’ inherent randomness makes them unsuitable for general legal interpretation today, they may be able to serve as diagnostic tools for identifying cases where consensus may exist about textual meaning. “When multiple LLMs consistently converge on the same interpretation across different prompts, models, and test conditions—as with clear cases like ‘Is a car a vehicle?’—this convergence may indicate authentic ordinary meaning that transcends individual model inconsistencies. Perhaps it takes systems that cannot read like humans to show us the rare moments when all humans actually read alike.”

Lawrence Lessig, the Roy L. Furman Professor of Law and Leadership at Harvard and moderator of the panel, wondered if these issues were not necessarily inherent to large language models, but rather a result of the way in which they are commercially deployed. Did all of this point to a need for a model specific to the legal context, to “build infrastructure that judges [can] rely on”?

“I think the truth is, large language models do make people think in different ways. They make us more productive. They help us find things,” said Chen. “And I think to the question that you’re asking right now [is], ‘How can we build a large language model, knowing that people will use them no matter what?’”

She suggested that transparency — of datasets, of how data is collected — will be key.

Okediji, whose work on AI has long championed data transparency, agreed. Newsom also concurred, and suggested extending that transparency perhaps even to those who might use LLMs to generate official court documents. “I think, anyway, that with the appropriate disclosure, if somebody files a brief that says, ‘Hey, just so you know, ChatGPT wrote this brief,’ I’m ok with that. We’re going check the cites anyway.”

After all, Newsom said, “LLMs definitely hallucinate. But lawyers do too, only intentionally.”

Want to stay up to date with Harvard Law Today? Sign up for our weekly newsletter.

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Resources

Campus & Community

Ordinary reader or arbitrary word generator?

‘Useful in a different way’