Thanks for the most definitive, balanced, and thorough article on the NYT vs. OpenAI case. I definitely learned a thing or two (but not 1,000 things, because at that scale, we're veering dangerously out of "fair use" territory).
Do you know what prompts were used to have ChatGPT create some of the examples you gave? I am curious how the request was crafted. I am interested if the person creating the prompt was trying to get ChatGPT to generate the material to prove the point of copyright material being part of the machine learning or being cheap and trying to get around the paywall. I am researching GenAI and prompt engineering for a class and wanted to be able to show how the response was generated. Thank you for an excellent article.
The MP3.com case was decided very quickly, but Texaco (9) and Google (10) cases took years to resolve. At the speed OpenAI is developing, will anything in the case against (more or less) GPT-3 be relevant against GPT-13?
To me it seems simple. If it's publicly available, i.e. on a tweet, open website etc, it's fair game for AI or anyone else. They have the same measure of privacy that would be expected on a bulletin board at a coffeshop or town square, i.e. none. To claim since it's your writing, AI can't use it but anyone walking by can, makes zero since. Now if it's taking advantage of private material beyond a paywall, or not attributing sources, that's different. But I don't think what's what the lawsuit is about.
It seems like you could solve this with some creative training. For example, when you save copywritten training data replace every space or double space after a sentence with 4 half-spaces. Then either train or hardcode the AI to refuse the most likely token when the last token in the prompt is 4 half-spaces.
By continuously kicking it off track it should make it difficult to reproduce near exact copies. Also, not allowing the most likely token to be chosen should tend to make the output perform worse under any RLHF. That in turn should make the AI attempt to avoid exact quotes of copywritten material.
Indeed, it's likely that it will come to grasp that token strings containing 4 half-spaces are fundamentally different.
Now, there are likely problems with what I just suggested. It immediately comes to mind that quotations would be an exception. My point though is that this is just an off the cuff idea. It seems like a more serious investigation could solve this problem.
This is a great, interesting article. Thank you. Funny that I read it today, because I used DALL-E last night to create an image of Pokemon playing pool at a bar. (To be a cover image used for a blog post.) I kept reprompting it to change the image to my liking. Some of the responses would list it as “…cute animated animals…” clearly trying to bypass using the word “Pokemon,” but some responses didn’t even bother disguising it. There were enough weird non-Pokemon animals featured, but play with it long enough, and it will spit out Pikachus.
As I've said, there is a root issue here - OpenAI et al. are arguing that AIs have - or should have - the same rights as people, that is, to read anything it pleases, and to recombine & repurpose that reading as it pleases, as long as they avoid producing close replicas of copyrighted works.
There is a big problem with this theory: AIs, unlike people, can be "owned" by a commercial entity, and as a result, must be considered to function as extensions of that commercial entity, rather than as independent agents.
If I had, for example, a personally-controlled AI - that is, one to which I and no commercial entity had access - I think I would be within my rights to read it any books from the library I pleased, or to show it any paintings at the museum I pleased, and to ask it to reproduce these in part or in whole, just as I could in theory memorize a written work or reproduce a painting for myself. I might even be within my rights to utilize a personal AI so trained for commercial purposes, or in the course of employment - it's not doing anything I couldn't do, with enough time, or wouldn't be allowed to do.
There are much greater restrictions on corporations & their agents than there are on persons, and I think these AI companies would be wise to steer clear of any arguments like theirs here that essentially rest on AI personhood...you're not allowed to own a person.
This is a clear and balanced article. I personally read way better with my ears, and found this worth running through an AI narrator for easy listening. Let me know if this isn't something that you want to exist and want me to get rid of it.
Really nice article, and fun to bring up mp3.com. I originally thought you were going to talk about Aereo as well, but I had to look it up and that one wasn't about Fair Use for place shifting at all.
Maybe not! I think the NYT and other entities suing under similar premises will ultimately lose. Ther ei sno difference between a person reading the NYT and then regurgitating what they read to friends and an AI "reading" the NYT and then providing answers to questions using that digested info.
------
OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit
The artificial intelligence start-up argued that its online chatbot, ChatGPT, is not a substitute for a New York Times subscription.
By Cade Metz and Katie Robertson
Feb. 27, 2024
OpenAI filed a motion in federal court on Monday that seeks to dismiss some key elements of a lawsuit brought by The New York Times Company.
The Times sued OpenAI and its partner Microsoft on Dec. 27, accusing them of infringing on its copyrights by using millions of its articles to train A.I. technologies like the online chatbot ChatGPT. Chatbots now compete with the news outlet as a source of reliable information, the lawsuit said.
In the motion, filed in U.S. District Court for the Southern District of New York, the defendants argue that ChatGPT “is not in any way a substitute for a subscription to The New York Times.”
“In the real world, people do not use ChatGPT or any other OpenAI product for that purpose,” the filing said. “Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”
One thought I have: literally everything entered into these needs to be classified as PII and the heaviest of the Mahler’s 6th hammers must be swung at anyone who violates the privacy of a user.
Thanks for the most definitive, balanced, and thorough article on the NYT vs. OpenAI case. I definitely learned a thing or two (but not 1,000 things, because at that scale, we're veering dangerously out of "fair use" territory).
Do you know what prompts were used to have ChatGPT create some of the examples you gave? I am curious how the request was crafted. I am interested if the person creating the prompt was trying to get ChatGPT to generate the material to prove the point of copyright material being part of the machine learning or being cheap and trying to get around the paywall. I am researching GenAI and prompt engineering for a class and wanted to be able to show how the response was generated. Thank you for an excellent article.
The MP3.com case was decided very quickly, but Texaco (9) and Google (10) cases took years to resolve. At the speed OpenAI is developing, will anything in the case against (more or less) GPT-3 be relevant against GPT-13?
To me it seems simple. If it's publicly available, i.e. on a tweet, open website etc, it's fair game for AI or anyone else. They have the same measure of privacy that would be expected on a bulletin board at a coffeshop or town square, i.e. none. To claim since it's your writing, AI can't use it but anyone walking by can, makes zero since. Now if it's taking advantage of private material beyond a paywall, or not attributing sources, that's different. But I don't think what's what the lawsuit is about.
This article changed my mind on the legal question. Great job!
It seems like you could solve this with some creative training. For example, when you save copywritten training data replace every space or double space after a sentence with 4 half-spaces. Then either train or hardcode the AI to refuse the most likely token when the last token in the prompt is 4 half-spaces.
By continuously kicking it off track it should make it difficult to reproduce near exact copies. Also, not allowing the most likely token to be chosen should tend to make the output perform worse under any RLHF. That in turn should make the AI attempt to avoid exact quotes of copywritten material.
Indeed, it's likely that it will come to grasp that token strings containing 4 half-spaces are fundamentally different.
Now, there are likely problems with what I just suggested. It immediately comes to mind that quotations would be an exception. My point though is that this is just an off the cuff idea. It seems like a more serious investigation could solve this problem.
This is a great, interesting article. Thank you. Funny that I read it today, because I used DALL-E last night to create an image of Pokemon playing pool at a bar. (To be a cover image used for a blog post.) I kept reprompting it to change the image to my liking. Some of the responses would list it as “…cute animated animals…” clearly trying to bypass using the word “Pokemon,” but some responses didn’t even bother disguising it. There were enough weird non-Pokemon animals featured, but play with it long enough, and it will spit out Pikachus.
A lot of the times it seems like IP law cases are won by the attorneys that can holler "screw you" the loudest.
As I've said, there is a root issue here - OpenAI et al. are arguing that AIs have - or should have - the same rights as people, that is, to read anything it pleases, and to recombine & repurpose that reading as it pleases, as long as they avoid producing close replicas of copyrighted works.
There is a big problem with this theory: AIs, unlike people, can be "owned" by a commercial entity, and as a result, must be considered to function as extensions of that commercial entity, rather than as independent agents.
If I had, for example, a personally-controlled AI - that is, one to which I and no commercial entity had access - I think I would be within my rights to read it any books from the library I pleased, or to show it any paintings at the museum I pleased, and to ask it to reproduce these in part or in whole, just as I could in theory memorize a written work or reproduce a painting for myself. I might even be within my rights to utilize a personal AI so trained for commercial purposes, or in the course of employment - it's not doing anything I couldn't do, with enough time, or wouldn't be allowed to do.
There are much greater restrictions on corporations & their agents than there are on persons, and I think these AI companies would be wise to steer clear of any arguments like theirs here that essentially rest on AI personhood...you're not allowed to own a person.
This is a clear and balanced article. I personally read way better with my ears, and found this worth running through an AI narrator for easy listening. Let me know if this isn't something that you want to exist and want me to get rid of it.
https://askwhocastsai.substack.com/p/why-the-new-york-times-might-win?sd=pf
Really nice article, and fun to bring up mp3.com. I originally thought you were going to talk about Aereo as well, but I had to look it up and that one wasn't about Fair Use for place shifting at all.
Overall, a clear and balanced response to what I wrote. Well written. Well thought out.
We’re putting out a note tomorrow that comes out somewhat differently than your piece. I’d love your thoughts when it’s out.
Maybe not! I think the NYT and other entities suing under similar premises will ultimately lose. Ther ei sno difference between a person reading the NYT and then regurgitating what they read to friends and an AI "reading" the NYT and then providing answers to questions using that digested info.
------
OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit
The artificial intelligence start-up argued that its online chatbot, ChatGPT, is not a substitute for a New York Times subscription.
By Cade Metz and Katie Robertson
Feb. 27, 2024
OpenAI filed a motion in federal court on Monday that seeks to dismiss some key elements of a lawsuit brought by The New York Times Company.
The Times sued OpenAI and its partner Microsoft on Dec. 27, accusing them of infringing on its copyrights by using millions of its articles to train A.I. technologies like the online chatbot ChatGPT. Chatbots now compete with the news outlet as a source of reliable information, the lawsuit said.
In the motion, filed in U.S. District Court for the Southern District of New York, the defendants argue that ChatGPT “is not in any way a substitute for a subscription to The New York Times.”
“In the real world, people do not use ChatGPT or any other OpenAI product for that purpose,” the filing said. “Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”
...
https://www.nytimes.com/2024/02/27/technology/openai-new-york-times-lawsuit.html
One thought I have: literally everything entered into these needs to be classified as PII and the heaviest of the Mahler’s 6th hammers must be swung at anyone who violates the privacy of a user.