The paper (as reported) looked at Llama 3.1 70b. It is possible that the larger version with 405b parameters would have a higher probability than the 70b version
The probability being written about here is 0.5, which is insanely high. But according to the paper, a probability of 0.01 (still really high, for LLMs, as discussed in the paper) results in >90%? It seems maybe safe to say that the entire thing is in there, but this article is trying to be judicious to avoid being misunderstood about what this >90% number means?
Seems like there's a big difference between "can correctly continue a 50 token quote more than half the time for 42 percent of the book" and "can generate 42 percent of the book". If you didn't have the text of Harry Potter handy, there's no way you could get out a quote longer than a couple of pages, except maybe some sections that are quoted online in tons of places. And if you did get it to generate a chunk, you wouldn't know if it was right. A user who wants to read the book for free would be much better off googling "free online copy of Harry Potter."
One is "Is Llama 3.1 a practical way to get a copy of Harry Potter." Obviously the answer is no.
The other is "Is Llama 3.1 a derivative work of Harry Potter for copyright purposes?" I think the 42 percent figure is strong evidence that the answer is "yes." And this may mean that Meta needs to get a license from J.K. Rowling before it can publish the model—or maybe even use it internally. That is a big deal even if nobody is going to actually try to extract the full text of the book from the model.
I came here to say what Erick said. I understand your point, however.
What is definition of derivative here? For instance, if I created a detailed statistical analysis of the colors used on Picaso paintings, is there a derivative? Does copyright law consider the medium of the derivative work.
Does it matter how many input tokens it takes to get a specific output? Which is also a core question of NYT v OpenAI, right? Is it infringement if a model will only cough up long portions of work if prompted exactly right, or does it only matter that it is possible at all?
In the case of the paper, it seems like it should matter. To vastly oversimplify, f(n) = output where n is some tokenized input from Harry Potter, et. al. f(n) literally cannot work without specifying that n = something. Is it really “memorized” if it takes substantial input to extract the output?
But briefly, the Copyright Act defines copies as "material objects in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device."
So by this definition, Llama 3.1 is a copy of Harry Potter because it's fixed in a material object (a hard drive or SSD) and can be reproduced with the aid of a machine (a computer that does inference). The fact that it requires a very specific prompting technique to reproduce the work isn't important from a copyright perspective.
I’d be very concerned if this makes it harder to release open weight models. Having only proprietary models means that it’s harder to do research and it gives all the power to the companies that already have power.
The US copyright office has released a 100-page document ("Copyright and Artificial Intelligence. Part 3: Generative AI Training") explaining in far more detail, why already training most commercial AI models is not fair use. Memorization as well is known since years and already discussed in the document.
If I borrow a book from a friend, have I violated copyright? What if I have perfect memorization?
Jeopardy players seem to know a huge amount of esoteric info. Have they violated copyright by reading and remembering sources of info?
I just don't agree that if an AI accesses publicly available info, whether it might be from TB broadcasts or even by obtaining a library card and checking out every single book available in a library, that they should be judged guilty of copyright violation.
This is just a ploy by authors and their agents to squeeze more money out of deep pocket AI companies.
Even... More? You know most authors have nothing right? Art is not a path to riches.
But thankfully copyright law has already answered your question. Perfect memorisation is fine. But if you reproduce it from your memory, you're committing a crime. It's the reproduction, and redistribution, that is the problem.
Which is why "clean room implementations" are often done when making compatible software, for example. The programmer is not allowed to read the opposing code, because they will probably end up violating copyright if they do, from memory alone.
US-based closed foundation models are now directly competing with Chinese models trained on far more permissive, aggressive, and unregulated data acquisition strategies — including the wholesale ingestion of copyrighted English-language content. The models released from these contexts aren’t just outperforming for technical reasons. They’re reflecting a fundamentally different paradigm: one that doesn’t pretend to care about the legalistic or ethical frameworks that US firms gesture toward. The results are clear — stunning fluency, emergent capabilities, and a tighter integration of global English-language media than any Western firm can currently match. The consequences are being felt not just in benchmarks, but in diplomatic backchannels and investor boardrooms.
The problem isn’t just that Western firms are losing ground. It’s that they can’t admit why. Because to admit it would be to admit that their own house isn’t in order either. The public relations gloss about fairness and authors’ rights is a smokescreen. These companies don’t care about protecting English-language creators. They care that someone else — a geopolitical rival — is better at stealing from them than they are from you. And because this is a market battle, not a moral one, the question becomes: whose theft wins? And who gets to decide which extraction is legitimate?
Meanwhile, the entire industry is under the cultural gravity of a system that rewards deception. Good people get paid to lie. And a growing share of the people building these systems aren’t even particularly good. They operate in sealed rooms, speaking in technical euphemism, selling abstraction to managers, and selling promises to governments. They are rewarded not for truth, but for traction. And traction comes from alignment with power.
All of this unfolds under the long shadow of great power conflict. Many of the largest models — both open and closed — are already entangled with military, surveillance, or intelligence networks. These affiliations aren’t hypothetical; they are funding streams, personnel pipelines, and quiet agreements. Nationalist values are being hard-coded into emergent systems under the guise of neutrality. And because these alignments are unacknowledged, they go unscrutinized. What you’re seeing is not just a technical race. It’s the unspoken convergence of empire and algorithm.
For Harry Potter there are "fanfiction stories" that are verbatim book text interspersed with commentary (often by other characters not present in the scene in question) written by the author. (Sort of the 'Let's Play' of the fanfic world).
If these got swept up it would explain the presence of long passages in the final output
This seems… completely unsurprising? I’m actually shocked it was’t 99%, but it is only llama
The paper (as reported) looked at Llama 3.1 70b. It is possible that the larger version with 405b parameters would have a higher probability than the 70b version
The probability being written about here is 0.5, which is insanely high. But according to the paper, a probability of 0.01 (still really high, for LLMs, as discussed in the paper) results in >90%? It seems maybe safe to say that the entire thing is in there, but this article is trying to be judicious to avoid being misunderstood about what this >90% number means?
Seems like there's a big difference between "can correctly continue a 50 token quote more than half the time for 42 percent of the book" and "can generate 42 percent of the book". If you didn't have the text of Harry Potter handy, there's no way you could get out a quote longer than a couple of pages, except maybe some sections that are quoted online in tons of places. And if you did get it to generate a chunk, you wouldn't know if it was right. A user who wants to read the book for free would be much better off googling "free online copy of Harry Potter."
There are two different questions here.
One is "Is Llama 3.1 a practical way to get a copy of Harry Potter." Obviously the answer is no.
The other is "Is Llama 3.1 a derivative work of Harry Potter for copyright purposes?" I think the 42 percent figure is strong evidence that the answer is "yes." And this may mean that Meta needs to get a license from J.K. Rowling before it can publish the model—or maybe even use it internally. That is a big deal even if nobody is going to actually try to extract the full text of the book from the model.
I came here to say what Erick said. I understand your point, however.
What is definition of derivative here? For instance, if I created a detailed statistical analysis of the colors used on Picaso paintings, is there a derivative? Does copyright law consider the medium of the derivative work.
Does it matter how many input tokens it takes to get a specific output? Which is also a core question of NYT v OpenAI, right? Is it infringement if a model will only cough up long portions of work if prompted exactly right, or does it only matter that it is possible at all?
In the case of the paper, it seems like it should matter. To vastly oversimplify, f(n) = output where n is some tokenized input from Harry Potter, et. al. f(n) literally cannot work without specifying that n = something. Is it really “memorized” if it takes substantial input to extract the output?
If you want the full details I'd encourage you to read this paper by Cooper and Grimmelmann, which lays out the legal analysis.
https://arxiv.org/abs/2404.12590
But briefly, the Copyright Act defines copies as "material objects in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device."
So by this definition, Llama 3.1 is a copy of Harry Potter because it's fixed in a material object (a hard drive or SSD) and can be reproduced with the aid of a machine (a computer that does inference). The fact that it requires a very specific prompting technique to reproduce the work isn't important from a copyright perspective.
I’d be very concerned if this makes it harder to release open weight models. Having only proprietary models means that it’s harder to do research and it gives all the power to the companies that already have power.
The US copyright office has released a 100-page document ("Copyright and Artificial Intelligence. Part 3: Generative AI Training") explaining in far more detail, why already training most commercial AI models is not fair use. Memorization as well is known since years and already discussed in the document.
If I borrow a book from a friend, have I violated copyright? What if I have perfect memorization?
Jeopardy players seem to know a huge amount of esoteric info. Have they violated copyright by reading and remembering sources of info?
I just don't agree that if an AI accesses publicly available info, whether it might be from TB broadcasts or even by obtaining a library card and checking out every single book available in a library, that they should be judged guilty of copyright violation.
This is just a ploy by authors and their agents to squeeze more money out of deep pocket AI companies.
Even... More? You know most authors have nothing right? Art is not a path to riches.
But thankfully copyright law has already answered your question. Perfect memorisation is fine. But if you reproduce it from your memory, you're committing a crime. It's the reproduction, and redistribution, that is the problem.
Which is why "clean room implementations" are often done when making compatible software, for example. The programmer is not allowed to read the opposing code, because they will probably end up violating copyright if they do, from memory alone.
That settles it, if it comes down to how familiar the AI is with each peice of work individually. There's no good case
US-based closed foundation models are now directly competing with Chinese models trained on far more permissive, aggressive, and unregulated data acquisition strategies — including the wholesale ingestion of copyrighted English-language content. The models released from these contexts aren’t just outperforming for technical reasons. They’re reflecting a fundamentally different paradigm: one that doesn’t pretend to care about the legalistic or ethical frameworks that US firms gesture toward. The results are clear — stunning fluency, emergent capabilities, and a tighter integration of global English-language media than any Western firm can currently match. The consequences are being felt not just in benchmarks, but in diplomatic backchannels and investor boardrooms.
The problem isn’t just that Western firms are losing ground. It’s that they can’t admit why. Because to admit it would be to admit that their own house isn’t in order either. The public relations gloss about fairness and authors’ rights is a smokescreen. These companies don’t care about protecting English-language creators. They care that someone else — a geopolitical rival — is better at stealing from them than they are from you. And because this is a market battle, not a moral one, the question becomes: whose theft wins? And who gets to decide which extraction is legitimate?
Meanwhile, the entire industry is under the cultural gravity of a system that rewards deception. Good people get paid to lie. And a growing share of the people building these systems aren’t even particularly good. They operate in sealed rooms, speaking in technical euphemism, selling abstraction to managers, and selling promises to governments. They are rewarded not for truth, but for traction. And traction comes from alignment with power.
All of this unfolds under the long shadow of great power conflict. Many of the largest models — both open and closed — are already entangled with military, surveillance, or intelligence networks. These affiliations aren’t hypothetical; they are funding streams, personnel pipelines, and quiet agreements. Nationalist values are being hard-coded into emergent systems under the guise of neutrality. And because these alignments are unacknowledged, they go unscrutinized. What you’re seeing is not just a technical race. It’s the unspoken convergence of empire and algorithm.
And none of the models recognized the wrong book title? The actual one is "Harry Potter and the Philosopher's Stone".
It's Sorcerer's Stone in the US, Philosopher's Stone elsewhere. One the concessions Rowling made as an early author she now regrets.
For Harry Potter there are "fanfiction stories" that are verbatim book text interspersed with commentary (often by other characters not present in the scene in question) written by the author. (Sort of the 'Let's Play' of the fanfic world).
If these got swept up it would explain the presence of long passages in the final output