Copyright lawsuits against LLMs found wanting

Plus the Chinese livestreamers who literally never sleep. Because they're AIs.

Nov 24, 2023

“A judge tearing up a set of papers, with horrified lawyers looking at him”. Certainly not going to assert any copyright in this, which was generated by some free site out there. I like the Gorsuch vibes, yet also the comic-book effect.

A month ago I wrote that I’m perfectly happy to let robots—well, LLMs—scan the content of books that I’ve written so that they can learn my wonderfully desirable writing style, to help them make sense when people ask them questions.

However, there are plenty of people (content producers, predominantly) who don’t see it that way. They, with the help of various groups of lawyers, have launched lawsuits against the companies which run LLMs, and which run image generation systems.

The basis of these lawsuits tends to be the same: copyright infringement in the first place by vacuuming up all the content used to train the machine learning system, and then continued copyright infringement in the output of the systems.

There were some dramatic headlines when the lawsuits were first introduced, in January. Suing Stability AI, Midjourney and DeviantArt, three artists alleged, in an attempted class action (ie potentially representing everyone in the US who was affected) that

these organizations have infringed the rights of “millions of artists” by training their AI tools on five billion images scraped from the web “without the consent of the original artists.”

The same law firm that filed that suit also filed a suit in November 2022 against Microsoft, OpenAI and GitHub (collectively) over CoPilot, trained on code from the web.

And, to round it off, the comedian Sarah Silverman and two authors, Christopher Golden and Richard Kadrey, filed suit in July against Meta (parent of Facebook) and OpenAI on the basis that they

“did not consent to the use of their copyrighted books as training material for ChatGPT. Nonetheless, their copyrighted materials were ingested and used to train ChatGPT.”

Well, eventually lawsuits will reach judges if nothing else intrudes, and so in the past few weeks two of those suits have indeed come up before the bench.

And things are going rather as I expected they would; though of course with the proviso that you can never be absolutely certain how a judge will rule.

First up was the lawsuit against Stability AI, Midjourney and DeviantArt. At the end of October, US district judge William Orrick ruled that

copyright infringement claims cannot move forward against Midjourney and DeviantArt, concluding the accusations are “defective in numerous respects.” Among the issues are whether the AI systems they run on actually contain copies of copyrighted images that were used to create infringing works and if the artists can substantiate infringement in the absence of identical material created by the AI tools.

As the Hollywood Reporter notes, Orrick was pretty brutal in his dismissal of the claims. As two of the artists didn’t have anything registered with the Copyright Office, Orrick dismissed their claims at once.

For the rest,

DeviantArt vigorously disputes the assertions made throughout the Complaint that “embedded and stored compressed copies of the Training Images” are contained within Stable Diffusion. DeviantArt (and Stability and Midjourney) argue that those assertions are implausible given plaintiffs’ allegation that the training dataset was comprised of five billion images; five billion images could not possibly be compressed into an active program.

Which is very much the point. Those images aren’t inside the program. Just like the books you’ve read aren’t “inside” your head, even if you can recite them from memory. Machine learning systems “learn” to give weights to various elements of their inputs so that the outputs they produce receive “approval”. And the artists effectively admit that, saying in their complaint that “in general, none of the Stable Diffusion outputs images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data.”

So you’re saying you can’t get your content out? Thanks for undermining your own case. Orrick does leave some room for the artists to revise their case, and try to improve their argument for a future refiling. But given that they undermined their own case so thoroughly, and couldn’t even manage to get a knife blade into the copyright angle (which is what it all rests on, really), one has to think that’s going to be a hiding to nothing.

OK, so we now move on to the claim against Meta and OpenAI by Sarah Silverman et al. Oh dear. Once again, the claims don’t survive first contact with the enemy, aka jurisprudence. Judge Vince Chhabria on Monday handed down what the Hollywood Reporter called

“a full-throated denial of one of the authors’ core theories that Meta’s AI system is itself an infringing derivative work made possible only by information extracted from copyrighted material”.

In other words, he said that you can’t say that just because a chatbot puts out words, and that it took in your words, that therefore it’s infringing your copyright. Because it isn’t.

Chhabria’s ruling followed Meta’s request to dismiss all the claims, except for one alleging unauthorised copying of the authors’ books to train its LLaMa large language model. (That’s probably because Meta used the Books3 database.)

And so Chhabria did. Nor did he spare the blushes of the plaintiffs, calling their ideas “nonsensical”: the LLM outputs aren’t “derivative works” (defined as “a work based upon one of more preexisting works” in any “form in which a work may be recast, transformed or adapted”.) Chhabria says:

“There is no way to understand the LLaMa models themselves as a recasting or adaptation of any of the plaintiffs’ books.”

He then basically repeats this point a few times, and explains how far the plaintiffs are from getting anywhere near having a case. Which is a long way.

The Hollywood Reporter sums up what the confluence of those two rulings implies:

This means that plaintiffs across most cases will have to present evidence of infringing works produced by AI tools that are identical to their copyrighted material. This potentially presents a major issue because they have conceded in some instances that none of the outputs are likely to be a close match to material used in the training data. Under copyright law, a test of substantial similarity is used to assess the degree of similarity to determine whether infringement has occurred.

My take is that these lawsuits are doooooooomed. Given that British law is broadly similar to US law in this regard (or vice-versa), my expectation is that any similar lawsuit filed in the UK will fail in the same way.

Not that this is putting people off; earlier this week another class action lawsuit was filed against OpenAI and Microsoft by authors of nonfiction and academic works. There seem to be about a dozen of these cases buzzing around; it would be great if they were all consolidated into one about images, one about text, and one about programming (though arguably that’s the same as the text one).

Yet there’s also a little wrinkle to all this: AI-generated content doesn’t have copyright. In the US,

It has long been the posture of the U.S. Copyright Office that there is no copyright protection for works created by non-humans, including machines. Therefore, the product of a generative AI model cannot be copyrighted.

That obviously includes products where humans have told the AI model what to produce via a prompt. Irony is that prompts are fiercely defended, but the outputs themselves can be shared with impunity. Unless—and this is important—there has been some human work done on the output content itself before that’s then shared.

And yet a collection of AI-generated art, perhaps with human-written text embellishing the pictures, does get copyright for the whole thing—but not for the individual bits of art. (I guess the same would apply for AI-generated poetry, or chunks of text.) The key part is the human making a choice: copyright is all about human involvement in the process.

But these lawsuits? They’re a waste of time. Their proliferation tells you more about the way that the American legal system allows lots of potential payoffs than about the merits of the cases. Content creators, and that obviously includes me, have to learn to live among these things. I can use them to produce artwork that sits at the top of this article, and nobody’s losing out because of it; I would use a CC-BY licensed picture. Whether it’s from humans or machines, we’re now drowning in a sea of content. The lawsuits won’t change that.

Glimpses of the AI tsunami

(Of the what? Read here. And then the update.)

• From toy to tool: DALL-E 3 is a wakeup call for visual artists—and the rest of us. No signs of the art-generating systems slowing down.

• “Make It Real” AI prototype wows devs by turning drawings into working software. You draw the idea and OpenAI’s system writes the code behind it—a Breakout clone, for example. No word on whether it’s code that you can actually maintain.

• AI fakes have become part of the information war in Israel v Hamas. Inevitably. And a more nuanced take in Wired.

• These look like prizewinning photos. They’re AI fakes. The Washington Post is here to help.

• Deepfakes of Chinese influencers are livestreaming 24/7. In its way, taking influencing to its logical conclusion: human you needs to sleep occasionally, but nobody said computer you has to.

• You can buy Social Warming in paperback, hardback or ebook via One World Publications, or order it through your friendly local bookstore. Or listen to me read it on Audible.

You could also sign up for The Overspill, a daily list of links with short extracts and brief commentary on things I find interesting in tech, science, medicine, politics and any other topic that takes my fancy.

• Back next week! Or leave a comment here, or in the Substack chat, or Substack Notes, or write it in a letter and put it in a bottle so that The Police write a song about it after it falls through a wormhole and goes back in time.

Social Warming by Charles Arthur

Discussion about this post