Anthropic’s copyright settlement: What it means for AI training data

We’re excited to welcome back Jason Kunze, our colleague from the Chicago office, for his second appearance on A Little Privacy, Please! Jason is an intellectual property and technology attorney, and the last time he joined us was to unpack the legal complexities of web scraping. Today, he’s back to help us unpack the recent proposed settlement in Bartz v. Anthropic—a potential landmark moment in the intersection of AI and copyright law. This case raises questions about how companies source and repurpose massive datasets in the context of training AI engines.

What’s the background on the Anthropic copyright dispute?

The Anthropic case is really exciting because it’s one of the first significant decisions we have on the AI and machine learning front. Companies like Microsoft, Meta, and Anthropic are pulling in huge data training sets to build out their tools.

Anthropic’s tool is called Claude. They sourced data in two ways.

First, they pulled from online pirate datasets—one of them is called “LibGen,” and there are a couple of others—with around 7 million books. They used at least some of that data to train Claude.

Second, they purchased large quantities of books from publishers and resellers, which they then scanned into their system. These were purchases of used books subject to written agreements.

Why does it matter that Anthropic used both purchased and pirated books?

For the purchased books, the court issued an opinion saying that it’s okay because it’s a very transformative use. Under copyright law, fair use is a multi-factor test and very case-specific. In this situation, the court said the use for training of Claude was super transformative—taking the information and applying it in new ways. So scanning and using those purchased books for training was fine.

But the court also addressed the roughly 7 million pirated books. Anthropic didn’t use all of them for training, but they retained the dataset anyway. The court took real issue with that. Since the court rejected Anthropic’s fair use defense for the pirated data set, that led to the proposed settlement, which is still pending. The court’s initial response was that the proposed settlement needs more work. We may be about a month away from knowing if it goes through.

Is $1.5 billion a typical settlement amount for copyright infringement?

It could have been a lot bigger. Anthropic was potentially going to trial. It’s important to distinguish between the full library of pirated works—about 7 million—and the 500,000 registered works the parties are addressing in the settlement agreement. Registered works are eligible for statutory damages, which range from $750 to $30,000 per work, or much more if there’s willful infringement.

$3,000 is toward the lower end, but the plaintiffs faced risks going to trial, too. This is a new area of law. Other court opinions could go in different directions. Anthropic had defenses.

So is $3,000 a good deal? Well, Anthropic thinks so—they’re willing to sign off. That tells you something about the value of the dataset and the difficulty of licensing individually for all these authors. The plaintiffs’ class arguably did Anthropic a favor by pooling authors together for a large licensing agreement.

Where did Anthropic get access to pirated books?

Common databases are floating around online that anyone can access. Think back 15 years to when people were pirating movies—peer-to-peer networks were huge. You could download a new movie online two days after its release. People built up massive libraries. Meta also leveraged LibGen and other datasets to train its large language model, as discussed in the court decision two days later (see Kadrey v. Meta Platforms Inc., No. 23-cv-03417 (N.D. Cal. June 25, 2025)).

These libraries are available, but everyone knows they’re not legitimately licensed. There’s risk. Companies assess the value of the training data and decide it’s worth it. They go ahead and use it, figuring they’ll sort it out later. Now we’re seeing what the price tag looks like.

Can AI companies legally license all the data they need for training?

It’s hard, even if they want to pay. How do you contact all these individuals? There are many works where the authors aren’t easily identifiable or reachable. I have no doubt Anthropic would like to license those works formally, but they may not be able to reach the authors.

It’s very hard for companies to find a way to license at scale. In other industries, like music, you have blanket license agreements. If I run a bar and play music, I pay based on ASCAP or BMI rates. There’s a rate-setting court. But we don’t have that infrastructure for AI. I’m not even sure blanket licensing would work here.

In the absence of that, there aren’t many options. And again, the value of training data is so high that maybe $3,000 per work is a reasonable price.

What impact will this settlement have on future AI copyright cases?

Because this is the first settlement of its kind, it’s going to be an anchor for future negotiations. If that $3,000 per registered work number holds, then when another company faces similar claims in the coming months, someone’s going to throw that number out as a starting point.

A Little Privacy, Please!

Anthropic’s copyright settlement: What it means for AI training data

A Little Privacy, Please!

What’s the background on the Anthropic copyright dispute?

Why does it matter that Anthropic used both purchased and pirated books?

Is $1.5 billion a typical settlement amount for copyright infringement?

Where did Anthropic get access to pirated books?

Can AI companies legally license all the data they need for training?

What impact will this settlement have on future AI copyright cases?

Insights And Happenings

High-impact IP wins for game creators, led by Nixon Peabody

Protecting AI innovations: Patents vs. trade secrets

Fran Malloy on navigating a U.S. retail cyber incident