Ask ChatGPT — an artificial intelligence-powered chatbot — what to do in Nashville, and you’ll receive a recommendation to visit the Country Music Hall of Fame or catch a show at the Grand Ole Opry. The chatbot, which was created by OpenAI, can also tell you details about a predatory lending scandal involving New York’s taxi industry, or what a food critic thought of a Guy Fieri restaurant.

But how does the chatbot know so much about, well, so much?

According to a copyright infringement lawsuit filed by The New York Times in December against OpenAI and its partner Microsoft, it’s because OpenAI scraped millions of the media juggernaut’s articles — along with works by other content creators — from the web to create the knowledge base that drives ChatGPT.

In its suit, the Times alleges that, when prompted by users, ChatGPT sometimes spits out portions of its articles verbatim, or shares key parts of its content, such as findings uncovered through investigations by Times reporters, or product endorsements carefully researched and vetted by Wirecutter, an affiliate site. ChatGPT has also “hallucinated” — or made up — articles attributed to the Times, according to the filing. All this violates copyright law and undercuts the Times’s business model, which relies on licensing, subscriptions, and ad revenue, say its attorneys.

An initial response by OpenAI posted to its website called the lawsuit a “surprise and a disappointment,” and contended that it was “without merit.” In a motion to dismiss four of the newspaper’s claims in February, OpenAI continued to fight back, touting ChatGPT as “revolutionary,” but adding that the chatbot “is not in any way a substitute for a subscription to The New York Times.” In fact, the memorandum alleges, “the Times paid someone to hack OpenAI’s products,” and even so, it “took them tens of thousands of attempts to generate the highly anomalous results” that make up the basis of the lawsuit.

In an interview with Harvard Law Today, Mason Kortz ’14, a clinical instructor at the Harvard Law School Cyberlaw Clinic at the Berkman Klein Center for Internet & Society, said the suit will be the first big test for AI in the copyright law space. When Kortz and his colleagues began talking about this issue around 2017, he adds, “People said, ‘this is sci fi type stuff. We don’t have to worry about that.’” Now, with the attention of business, the media, and the public at large, the lawsuit could have a significant impact on the development of an ever-rising number of AI systems in the U.S. “I’m interested to see where all this goes,” says Kortz.


Harvard Law Today: What are The New York Times’s legal arguments in this case?

Mason Kortz: They have several copyright claims here, and I think you can essentially break them down into three groups. One is that the actual training was infringing, in the sense that when OpenAI scraped all this data from the web, it had to make copies. It created its own libraries that included New York Times material, which is literal copying from The New York Times servers to their servers. That material used for training was not licensed, and therefore, violates the right to reproduction.

Second is that the actual large language model that results at the end of the training is either a copy or possibly a derivative work of the Times’ body of copyrighted works. Because remember that the model is basically a very large set of statistics — something like 1.76 trillion data points.

The third argument is when someone uses ChatGPT and says, “tell me what The New York Times said about the first presidential debate,” or whatever, and it spits out something that is verbatim or close to verbatim text from a New York Times article. This is what the complaint referred to as “memorization,” and is a separate copyright violation, because now you are reproducing exactly or with little variation copyrighted expression owned by The New York Times.

HLT: What do you think of the Times’s various arguments?

Kortz: For the training, it is pretty clear that they created copies. OpenAI and Microsoft are likely going to say, yes, there was literal copying, but it was not infringement because it qualifies as fair use.

I think there’s more play with the other two. To say that the model itself is some sort of derivative work that constitutes a separate copyright violation — that is a pretty novel theory. There’s been a lot of discussion over the last five or six years about how you classify a statistical model as a “work” for copyright purposes. Some people would say it’s just not in the realm of copyright at all, because it is a set of facts, not expressions. But there’s another theory that you could call it a derivative work, because it is in a literal sense derived from copyrighted works. This is very untested.

The outputs claim is somewhere in the middle — I think it is less novel than the trained model claim. There could be some arguments that a statistical model that spits out memorized information might infringe the Times’ copyright. Here, it might come down to whether that’s fair use on the part of OpenAI or not.

“There’s been a lot of discussion over the last five or six years about how you classify a statistical model as a ‘work’ for copyright purposes. … There’s a theory that you could call it a derivative work, because it is in a literal sense derived from copyrighted works. This is very untested.”

HLT: What is the test for copyright infringement?

Kortz: The test looks at the similarity between the original work and the alleged infringing work. It’s often hard to establish actual copying. If I paint a painting, and it looks like someone else’s copyrighted painting, I can say, “oh, that’s just a coincidence that we happened to end up with similar looking paintings. I never saw the original.” A test of “substantial similarity” is often used to get around that question of whether there was copying.

In this case, it isn’t necessary to look at that, because OpenAI does not seem to be disputing that The New York Times articles are in its training data. It has been relatively transparent about the corpus on which its large language models are trained. But “substantial similarity” can also be important for determining whether the expressive elements of a copyrighted work were copied, or just the ideas embodied in it, because copyright protects expressions but doesn’t protect ideas. So, you can reimplement someone’s ideas in your own language, and that’s not a copyright violation. But if you take the other person’s language — their expression — that is a copyright violation.

HLT: How did OpenAI and Microsoft respond in its motion to dismiss?

Kortz: OpenAI and Microsoft both filed motions to dismiss some of The New York Times’s seven claims. With a motion to dismiss, you’re basically saying that, even if all the alleged facts are correct, the Times still hasn’t stated a valid claim under the law.

First, OpenAI is arguing that the Times cannot sue for infringement that happened more than three years ago, under the statute of limitations for copyright law. Second, both companies argue that the Times hasn’t made a contributory infringement claim because OpenAI did not have actual knowledge of the infringing activities of its users, and in fact, OpenAI says, such usage is explicitly barred by its terms of service.

The third claim involves the Digital Millennium Copyright Act, which was passed by Congress in 1998. A provision in the law encourages copyright holders to add content management information, or CMI, to digital assets — this is information that helps identify the creator or rightsholder, for example — and prohibits the removal of such information by others. The Times alleged that OpenAI violated the DMCA in removing that information when it scraped its articles for its database, but OpenAI responds that, where it did occur, it happened as part of an automatic process. It also argues that, with respect to ChatGPT’s outputs, at most, only an excerpt from Times articles is reproduced — and that that does not require the inclusion of CMI.

Finally, both companies seek to dismiss the claims made under New York state law, which it says are preempted by the federal Copyright Act.

HLT: What about the three claims that OpenAI did not seek to dismiss?

Kortz: OpenAI indicates that it thinks it can eventually prevail on those claims as well, and for many of them, it will likely rely on a defense called fair use. Here’s how it works: Initially, the burden is on The New York Times to show that there was a copyright violation. After that, the Times needs to show that the copied elements included protected expression.

Assuming they can prevail on those elements, then the burden is on OpenAI and Microsoft to raise some sort of affirmative defense. With fair use, you say, yes, there was infringement, but we’re not liable for it. Fair use is a statutory defense within the Copyright Act. There are some examples given of things that are typically fair use, including education, parody, satire, and criticism, but that’s not an exhaustive list. The way fair use is determined is through a four-factor test. The first factor is the character of the use, whether the allegedly infringing work is transformative or is merely duplicative of the original work. The second factor is the character of the original work. Works that are highly creative, that can contain a lot of expression, it’s harder to claim fair use. The third factor is the amount used. Did the alleged infringer use the entire protected work or only a small segment of it? And the fourth factor is what impact the allegedly infringing use has on the market for the original. Is this going to displace the market for the original, or is this entering a new market that the original copyright owner is not able to enter?

The New York Times’ argument here is if OpenAI hadn’t scraped all this data, they would have had to license the articles and the Times would have made X number of dollars, and therefore there is a market harm. Avoiding a licensing fee is considered a market harm, but for example, if I am criticizing a movie, and I use clips from the movie in my criticism, and then people don’t see the movie because they think it’s bad — that’s not a legitimate market harm, because I’m not displacing the market. So, there’s a lot of nuance about what is and isn’t a market harm. Generally speaking, most courts would say that that fourth factor is the most important.

“Assuming The New York Times can show there was a copyright violation, then the burden is on OpenAI and Microsoft to raise some sort of affirmative defense. With [the fair use defense], you say, yes, there was infringement, but we’re not liable for it.”

HLT: Are there any other arguments that OpenAI is likely to use in its defense in later stages of the suit?

Kortz: Yes. On the idea that the model itself is a derivative work, I think if I were OpenAI’s lawyers, I would challenge that, saying that this is not a copy or derivative work for the purposes of the Copyright Act. I would say that this isn’t a copy or derivative work. Instead, this is a set of factual observations about the data. For example, you could go to a library and write down the titles of all the books that are there and that would not constitute a derivative work of those books.

For that first bucket about the training, OpenAI is probably going to be almost entirely invested in fair use. For the output claim, there might be some argument that the reconstructions are not infringement, but I think that’s going to be tough, so I think they’re likely to have to fall back to fair use on the third bucket as well.

Another question this might come down to is, who is responsible? OpenAI clearly harvested all this data. So again, when it comes down to the training claims, I don’t think they have anyone to push this off on. But you could argue that, on the outputs, if ChatGPT spits out something that contains a memorized chunk of a New York Times article, that is a misuse of the tool by the person writing the prompt. Just like if I use Adobe Photoshop to create a copy of someone else’s work, Adobe isn’t the one doing the infringing, it’s the user. It’s maybe not the strongest argument, but it’s one that I think would not necessarily get you laughed out of court.

HLT: Do you think this case is likely to go to trial? Would the litigants want that?

Kortz: Statistically speaking, something like 98% of civil claims settle. I think that it’s likely that if it did go to trial, it would be an incredibly expensive suit for both sides. Because of that, and because of the uncertainty of some of the legal claims — you put 10 intellectual property lawyers in a room and you will get 11 different opinions on this — both sides are going to be facing uncertainty. And if they are risk averse, as most entities are, when it comes to the legal system, there’s probably a relatively wide band of settlement.

Generally speaking, settlement is better for litigants, unless you’re not just looking to recover damages, but also to make good law — that’s the idea behind strategic or impact litigation. I don’t think The New York Times would be averse to getting a good decision in favor of publishers on this. But I don’t think that’s why they’re engaging in this litigation.

“Because of the uncertainty of some of the legal claims — you put 10 intellectual property lawyers in a room and you will get 11 different opinions — there’s probably a relatively wide band of settlement.”

HLT: Let’s say this suit goes to court, and OpenAI is found to have violated copyright. What happens next?

Kortz: One possible type of relief is injunctive relief, where a court orders a defendant to stop doing the unlawful thing that it’s doing. And The New York Times does ask for that — permanently enjoining the defendants from unlawful, unfair, and infringing conduct. Which would mean, at the very least, not allowing any queries to ChatGPT that reproduce New York Times material. At worst for OpenAI, it could mean that any large language model that includes New York Times content, or was trained on New York Times content, needs to be deleted and basically rebuilt from scratch without The New York Times stuff in there. But that would impractical and could trigger other publishers like the Washington Post or LA Times to do the same thing, which would be an existential crisis for ChatGPT.

The other result could be damages. The New York Times alleges that it generally licenses articles at $10 each, and there’s an allegation in the complaint that the data set includes at least 16 million unique records of content from across the Times. If that were true, and they calculated the damages based off $10 each, that would be $160 million. If the infringement was willful, which means that the infringing party knew this was infringement and went ahead with it anyway, you could get increased statutory damages, and statutory damages can be up to $150,000 per copyrighted work. This could add up to astronomical, almost absurd, damages. There could be a crushing monetary liability here, such that even an entity like Microsoft could not continue to operate.

HLT: Where does this lawsuit fit within the context of litigation against AI systems more generally?

Kortz: There were a lot of suits that were filed about a year ago, and a lot of the initial ones were focused on image generation tools like Stability AI, which created Stable Diffusion, and the company that developed Midjourney. With these, you can say, “I want a picture of a penguin sitting on a green chair,” and it will produce that image. Visual artists sued and Getty Images sued. Then we started seeing individual authors, or groups of authors, suing AI systems, saying that their works had been ingested into large language models.

“At worst for OpenAI [if they were found to have violated copyright], it could mean that any large language model that includes New York Times content, or was trained on New York Times content, needs to be deleted and basically rebuilt from scratch without The New York Times stuff in there. But that would impractical and could trigger other publishers like the Washington Post or LA Times to do the same thing, which would be an existential crisis for ChatGPT.”

The New York Times lawsuit is interesting, because they seem to have more evidence of the direct memorization problem, where the system spits out things verbatim. Also, compared to some of the individual litigants, New York Times and Getty Images have the advantage of having deep pockets and in-house legal teams.

So far, a few of these cases have at least partially survived motions to dismiss, which just means the court has found that there is some plausible legal basis there. It’s definitely not a verdict, and we have not reached any sort of judicial holdings, let alone decisions in appeals courts that could be reliable precedent for years to come. We’re still very much in the formative stages of this area of law.

HLT: How should we weigh the need to compensate content creators for their work while also encouraging the development of new technology?

Kortz: These are all really fair concerns. In theory, OpenAI only needs to copy each New York Times article once, and it can essentially reuse them forever. That seems a little unfair. During the Screenwriters Guild strikes, one of the things they were opposed to was companies saying that if they owned the rights to a screenplay, they could use that to train AI to create future screenplays. There is the same concern for voice actors: that rather than pay someone to do voiceovers for commercials or cartoons or whatever, a company will have them come in and, for minimal pay, record 10 hours of saying random things be able to use their voice in perpetuity with AI tools. The displacement concern is very real.

But normative questions about who’s in the right and who’s in the wrong here get very complicated, because the companies concerned about AI aren’t always properly compensating their own content creators. As an example, look at record labels that are claiming that AI will destroy music and harm artists, but are only paying their artists five cents on the dollar.

Overall, I agree that there’s a high potential for individual writers, artists, and actors to suffer here. But I don’t necessarily think that the big companies that for the moment are aligning against the AI developers really have the little guys’ interests in mind all the time.


Want to stay up to date with Harvard Law Today? Sign up for our weekly newsletter.