45 Comments

Thanks for the most definitive, balanced, and thorough article on the NYT vs. OpenAI case. I definitely learned a thing or two (but not 1,000 things, because at that scale, we're veering dangerously out of "fair use" territory).

Expand full comment

Haha, luckily human learning really is outside the reach of copyright law because a copy needs to be fixed in a tangible medium and the human brain is not considered to be a tangible medium.

Expand full comment

My brain's been called many a bad thing but "not a tangible medium" is perhaps the most devastating insult of them all. I might never recover.

Well played, good sir!

Expand full comment

It’s a term of art, I guess because brains are tangible without any doubt!

Expand full comment

It is a term of art but also I think the distinction is more related to “medium” than “tangible.”

Expand full comment

Thanks. It is all about drawing distinctions wisely. Great article.

Expand full comment

Do you know what prompts were used to have ChatGPT create some of the examples you gave? I am curious how the request was crafted. I am interested if the person creating the prompt was trying to get ChatGPT to generate the material to prove the point of copyright material being part of the machine learning or being cheap and trying to get around the paywall. I am researching GenAI and prompt engineering for a class and wanted to be able to show how the response was generated. Thank you for an excellent article.

Expand full comment

Yes, I did it myself. They used the OpenAI playground with the temperature set to zero. Then they provided the first part of the article as the prompt and it produced the regurgitated sample as a response.

Expand full comment

Thank you.

Expand full comment

The MP3.com case was decided very quickly, but Texaco (9) and Google (10) cases took years to resolve. At the speed OpenAI is developing, will anything in the case against (more or less) GPT-3 be relevant against GPT-13?

Expand full comment

To me it seems simple. If it's publicly available, i.e. on a tweet, open website etc, it's fair game for AI or anyone else. They have the same measure of privacy that would be expected on a bulletin board at a coffeshop or town square, i.e. none. To claim since it's your writing, AI can't use it but anyone walking by can, makes zero since. Now if it's taking advantage of private material beyond a paywall, or not attributing sources, that's different. But I don't think what's what the lawsuit is about.

Expand full comment

Well the Times has a metered paywall that limits readers to ~10 articles per month. OpenAI trained its model on a lot more than 10 articles. So does that count as "taking advantage of private material beyond a paywall?" Or is your view that as soon as they make it freely available to anyone they lose control over its use in training?

Expand full comment

I think you're correct that the AI industry may have a problem with the "scalability" argument, but I disagree totally with that logic.

If I decide to read a bunch of free NYTs articles and write a summary to explain my view, or even to simply condense the articles into one summation, it is clearly NOT violating copyright. Right? So how is it violating copyright if an AI does it with 10,000 articles?

It is my case that if it's freely accessible, that's it. And even if it was not; how about if the owners of the AI owners paid the normal subscription fee? Now can they use it for training?

I think what I'm getting at is that AI, admitting totally that I'm not a tech expert, is just another way to sort data. How can you use specific rules against it that don't apply to people? It's like saying, if I sort data with a pen and paper I can do whatever I want with it, but if I use a software program I'm not going to allow it. It seems arbitrary. According to this logic we'd have one set of rules for people with shovels and another for bulldozer.

I admit I really hope AI takes off. Maybe that's coloring this, but I think this makes sense.

Expand full comment

Copyright law governs the making of copies, which must be "fixed in a tangible medium of expression." A piece of paper, a CD, and a computer hard drive are all tangible media. The human brain is not. So just at a baseline level, the law treats information processing by computers differently from human thought. When you read 100 books whatever happens in your mind is not governed by copyright law (and that's good obviously). If you scan 100 books into a computer, that might be governed by copyright law depending on the situation because you have made a copy as far as copyright law is concerned.

So I don't think the "how can there be rules that apply to computers but not people" argument really works. Maybe you think the law shouldn't draw such a distinction, but it does and the courts are obligated to apply the law as it's written.

Expand full comment

Or invalidate the law. Or modify/extend it to accommodate modern technology.

Expand full comment

But the AI isn't "copying" it right? It's scanning it and then creating something new. Does the AI actually have to reproduce it, copy it, before it uses it? Assuming it scans it, then uses it without creating any copies, but instead creates a new product which is a mix of different articles and sources - it is not "copying"?

Expand full comment

I think these questions misunderstand how computers work. Absolutely everything computers do involve making copies. We might colloquially say that a computer "scans a website" but what that actually means is that it downloads a copy of the website onto the disk (or into memory) and then does some analysis on the local copy. Moreover, the software that trains a model is not downloading copies of training data directly from the websites they came from. An organization like OpenAI is going to download billions of documents from various websites and put them all together in a big database which is then pre-processed in various ways before the actual training run occurs.

Maybe at the end of the training run they delete the local copy so that no permanent copy is stored, but there's no scenario where a computer uses data without copying it. A computer can't do anything with data unless it first makes a copy of it in its local storage.

Expand full comment

I just hope they work something out where they pay a fee or something so everyone gets credit, makes money, and the tech doesn't get run into the ground. Thanks for taking the time to answer my questions.

The "bulletin" standard, by the way, comes from criminal law which basically says that publicly posted information on social media can be used in court. That's why I assumed the same thing would apply here. But you're right, if it copies data it makes sense that copyright would apply.

Expand full comment

Interestingly in EU we have stricter law around personal information. EU citizens have the right to know how their personal identifiable information is being used and to remove it. It also can't be used without permission.

Expand full comment

The AI can perfectly reproduce Mario, Batman, Spiderman, Black Widow, and long chunks of text from specific new york times stories.

Just from a causality perspective, the model is clearly embedding the content of at least some of the training data inside it. The fact that you can't point to a specific chunk of the model and say "see, here's where they copied the new york times" doesn't mean it's not a copy, it's just a copy with a very obfuscated encoding. Someone described these models as a "hologram of the internet", and I think that's a very good metaphor.

Expand full comment

I still can't get past how you could make something freely, publicly available, and then get upset that's it used by a computer to generate answers to questions, art, etc. Even if it was paid content, if an AI producer paid to access it and then used your work to produce something else - it is no longer "your" article, art, etc. It has been used as a model, alongside thousands (maybe even millions right) of examples to produce new content.

Part of the entire point of computers is to process large amounts of data much faster than could be done before, I don't understand why this particular leap upsets so many people outside of the obvious. Writers who do very basic, unimagintive, work will probably lose their jobs. Maybe the same for artists. And even that is debatable; I think generally that this technology, as has each advance before it, will ultimately create more jobs - probably work we can't even imagine right now.

And I can see this leading to a model of various subscriptions to art/writing, where maybe AI creators have to pay to assimilate certain data. That makes sense too. But then it's not "free, public" information anyway, so it wouldn't have access to it for training right (without paying)?

Expand full comment

You don't lose copyright on a work by making it publicly available. It doesn't give anyone the right to incorporate it into their work beyond the bounds of fair use or whatever license to it you have provided.

When you look at the degree to which these models are clearly wholesale incorporating the training data, I don't really see a difference between this and someone copying source code out of an open source project and into their own closed source work.

Expand full comment

If you took someone's exact source code, which was made publicly available for free, then changed it in various ways for the exact same purpose only it did it's task better/or differently would it violate copyright?

This clearly doesn't apply to cellphones, cars, or any other technology, including writing stories. If I do an "exact" rip off of a plot except change characters and background, it may be tacky, but no one would claim it violated copyright laws. To the extent the changes are positive and useful to customers, or entertaining, they'll decide to use whatever alternative is being offered, or not.

Maybe I'd have to be walked through why this is different, if it is.

Expand full comment

I read the NYT daily via RSS. There does not seem to be any kind of paywell for me. I read their articles and make more than 100 comments monthly.

Expand full comment

This article changed my mind on the legal question. Great job!

Expand full comment

It seems like you could solve this with some creative training. For example, when you save copywritten training data replace every space or double space after a sentence with 4 half-spaces. Then either train or hardcode the AI to refuse the most likely token when the last token in the prompt is 4 half-spaces.

By continuously kicking it off track it should make it difficult to reproduce near exact copies. Also, not allowing the most likely token to be chosen should tend to make the output perform worse under any RLHF. That in turn should make the AI attempt to avoid exact quotes of copywritten material.

Indeed, it's likely that it will come to grasp that token strings containing 4 half-spaces are fundamentally different.

Now, there are likely problems with what I just suggested. It immediately comes to mind that quotations would be an exception. My point though is that this is just an off the cuff idea. It seems like a more serious investigation could solve this problem.

Expand full comment

Strongly suspect that the model will just learn to never produce 4 half spaces.

Expand full comment

It’s a good idea, but I the ink the fundamental problem is that you can’t *prove* it will never produce an exact copy.

In google’s case, the technology prevented it, but in OpenAI’s case, you can put higher and higher guardrails but you can never prove that a clever prompt can’t get exact text back again.

I think that’s what’s being argued here.

Expand full comment

This is a great, interesting article. Thank you. Funny that I read it today, because I used DALL-E last night to create an image of Pokemon playing pool at a bar. (To be a cover image used for a blog post.) I kept reprompting it to change the image to my liking. Some of the responses would list it as “…cute animated animals…” clearly trying to bypass using the word “Pokemon,” but some responses didn’t even bother disguising it. There were enough weird non-Pokemon animals featured, but play with it long enough, and it will spit out Pikachus.

Expand full comment

A lot of the times it seems like IP law cases are won by the attorneys that can holler "screw you" the loudest.

Expand full comment

As I've said, there is a root issue here - OpenAI et al. are arguing that AIs have - or should have - the same rights as people, that is, to read anything it pleases, and to recombine & repurpose that reading as it pleases, as long as they avoid producing close replicas of copyrighted works.

There is a big problem with this theory: AIs, unlike people, can be "owned" by a commercial entity, and as a result, must be considered to function as extensions of that commercial entity, rather than as independent agents.

If I had, for example, a personally-controlled AI - that is, one to which I and no commercial entity had access - I think I would be within my rights to read it any books from the library I pleased, or to show it any paintings at the museum I pleased, and to ask it to reproduce these in part or in whole, just as I could in theory memorize a written work or reproduce a painting for myself. I might even be within my rights to utilize a personal AI so trained for commercial purposes, or in the course of employment - it's not doing anything I couldn't do, with enough time, or wouldn't be allowed to do.

There are much greater restrictions on corporations & their agents than there are on persons, and I think these AI companies would be wise to steer clear of any arguments like theirs here that essentially rest on AI personhood...you're not allowed to own a person.

Expand full comment

This is a clear and balanced article. I personally read way better with my ears, and found this worth running through an AI narrator for easy listening. Let me know if this isn't something that you want to exist and want me to get rid of it.

https://askwhocastsai.substack.com/p/why-the-new-york-times-might-win?sd=pf

Expand full comment

Really nice article, and fun to bring up mp3.com. I originally thought you were going to talk about Aereo as well, but I had to look it up and that one wasn't about Fair Use for place shifting at all.

Expand full comment

Overall, a clear and balanced response to what I wrote. Well written. Well thought out.

Expand full comment

Glad you liked it!

Expand full comment

We’re putting out a note tomorrow that comes out somewhat differently than your piece. I’d love your thoughts when it’s out.

Expand full comment

Looking forward to it. Please email me!

Expand full comment

Maybe not! I think the NYT and other entities suing under similar premises will ultimately lose. Ther ei sno difference between a person reading the NYT and then regurgitating what they read to friends and an AI "reading" the NYT and then providing answers to questions using that digested info.

------

OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit

The artificial intelligence start-up argued that its online chatbot, ChatGPT, is not a substitute for a New York Times subscription.

By Cade Metz and Katie Robertson

Feb. 27, 2024

OpenAI filed a motion in federal court on Monday that seeks to dismiss some key elements of a lawsuit brought by The New York Times Company.

The Times sued OpenAI and its partner Microsoft on Dec. 27, accusing them of infringing on its copyrights by using millions of its articles to train A.I. technologies like the online chatbot ChatGPT. Chatbots now compete with the news outlet as a source of reliable information, the lawsuit said.

In the motion, filed in U.S. District Court for the Southern District of New York, the defendants argue that ChatGPT “is not in any way a substitute for a subscription to The New York Times.”

“In the real world, people do not use ChatGPT or any other OpenAI product for that purpose,” the filing said. “Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”

...

https://www.nytimes.com/2024/02/27/technology/openai-new-york-times-lawsuit.html

Expand full comment

One thought I have: literally everything entered into these needs to be classified as PII and the heaviest of the Mahler’s 6th hammers must be swung at anyone who violates the privacy of a user.

Expand full comment

No one should be surprised that the law says that when you expect to make a profit using someone else’s copyrighted work, you have to pay them. Students typically don’t make profits but schools do and schools have to pay too.

Expand full comment