> I recently participated in a NYT story about fair use and generative AI, and why I'm skeptical "fair use" would be a plausible defense for a lot of generative AI products. I also wrote a blog post (https://suchir.net/fair_use.html) about the nitty-gritty details of fair use and why I believe this.
> To give some context: I was at OpenAI for nearly 4 years and worked on ChatGPT for the last 1.5 of them. I initially didn't know much about copyright, fair use, etc. but became curious after seeing all the lawsuits filed against GenAI companies. When I tried to understand the issue better, I eventually came to the conclusion that fair use seems like a pretty implausible defense for a lot of generative AI products, for the basic reason that they can create substitutes that compete with the data they're trained on. I've written up the more detailed reasons for why I believe this in my post. Obviously, I'm not a lawyer, but I still feel like it's important for even non-lawyers to understand the law -- both the letter of it, and also why it's actually there in the first place.
> That being said, I don't want this to read as a critique of ChatGPT or OpenAI per se, because fair use and generative AI is a much broader issue than any one product or company. I highly encourage ML researchers to learn more about copyright -- it's a really important topic, and precedent that's often cited like Google Books isn't actually as supportive as it might seem.
> Feel free to get in touch if you'd like to chat about fair use, ML, or copyright -- I think it's a very interesting intersection. My email's on my personal website.
I'm an applied AI developer and CTO at a law firm, and we discuss the fair use argument quite a bit. It grey enough that whom ever has more financial revenues to continue their case will win. Such is the law and legal industry in the USA.
what twigs me about the argument against fair use (whereby AI ostensibly "replicates" the content competitively against the original) is that it assumes a model trained on journalism produces journalism or is designed to produce it. the argument against that stance would be easy to make.
The model isn't trained on journalism only, you can't even isolate its training like that. It's trained on human writing in general and across specialties, and it's designed to compete with humans on what humans do with text, of which journalism is merely a tiny special case.
I think the only principle positions to be had here is to either ignore IP rights for LLM training, or give up entirely, because a model designed to be general like human will need to be trained like a human, i.e. immersed in the same reality as we are, same culture, most of which is shackled by IP claims - and then, obviously, by definition, as it gets better it gets more competitive with humans on everything humans do.
You can produce a complaint that "copyrighted X was used in training a model that now can compete with humans on producing X" for arbitrary value of X. You can even produce a complaint about "copyrighted X used in training model that now outcompetes us in producing Y", for arbitrary X and Y that are not even related together, and it will still be true. Such is a nature of a general-purpose ML model.
This seems to be putting the cart before the horse.
IP rights, or even IP itself as a concept, isn’t fundamental to existence nor the default state of nature. They are contigent concepts, contigent on many factors.
e.g. It has to be actively, continuously, maintained as time advances. There could be disagreements on how often, such as per annum, per case, per WIPO meeting, etc…
But if no such activity occurs over a very long time, say a century, then any claims to any IP will likely, by default, be extinguished.
So nobody needs to do anything for it all to become irrelevant. That will automatically occur given enough time…
the analogy in the anti-fair-use argument is that if I am the WSJ, and you are a reader and investor who reads my newspaper, and then you go on to make a billion dollars in profitable trades, somehow I as the publisher am entitled to some equity or compensation for your use of my journalism.
That argument is equally absurd as one where you write a program that does the same thing. Model training is not only fair use, but publishers should be grateful someone has done something of value for humanity with their collected drivelings.
This is the checkmate. The moment anything is published, it is fair game, it is part of the human consciousness and available for incorporation in anything that it sits as a component. Otherwise, what is the fucking point of publishing, mere revenue? Are we all not collectively competing and contributing? Furthermore, is not anything copied from anything published arguably not satire? Protected speech satire?
Whether or not training is decided as fair use, it does seem like it could affect artists and authors.
Many artists don't like how image generators, trained on their original work, allow others to replicate their (formerly) distinctive style, almost instantly, for pennies.
Many authors don't like how language models can enable anyone to effortlessly create a paraphrased versions of the author's books. Plagiarism as a service.
Human artists and writers can (and do) do the same thing, but the smaller scale, slower speed, and higher cost reduces the economic effects.
I think it makes more sense in context of entertainment. However even in journalism, given the source data there's no reason an LLM couldn't put together the actual public facing article, video etc.
> they can create substitutes that compete with the data they're trained on.
If I'm an artist and copy the style of another artist, I'm also competing with that artist, without violating copyright. I wouldn't see this argument holding up unless it can output close copies of particular works.
Although the model weights themselves are also outputs of the training, and interestingly the companies that train models tend to claim model weights are copyrighted.
If a set of OpenAI model weights ever leak, it would be interesting to see if OpenAI tries to claim they are subject to copyright. Surely it would be a double standard if the outcome is distributing model weights is a copyright violation, but the outputs of model inference are not subject to copyright. If they can only have one of the two, the latter point might be more important to OpenAI than protecting leaked model weights.
Indeed, and to me it's one of the reasons it's hard to argue that generative AI violates copyright.
At least in the US, a derivative work is a creative (i.e. copyrightable) work in its own right. Neither AI models nor their output meet that bar, so it's not clear what the infringing derivative work could be.
Piracy generates works that are neither derivative nor wholly copies (e.g. pre-cracked software). They are not considered creative works in the current framework.
The distinction between a copy and a derivative work isn't the issue. A game is expressive content, regardless of whether it's cracked, modified, public domain, or whatever. If you distribute a pirated game, the thing you're distributing contains expressive content, so if somebody else holds copyright to that content then the use is infringing.
My point is that with LLM outputs that's not true - according to the copyright office they are not themselves expressive content, so it's not obvious how they could infringe on (i.e. contain the expressive content of) other works.
I think you're missing something really obvious here. Piracy is not expressive content. You call it a game, and therefore it must be - but it's not. It's simply an illegal good. It doesn't have to serve any purpose. It cannot be bound by copyright, due to the illegal nature. The Morris Worm wasn't copyrightable content.
Something is not required to be expressive content, to be bound under law. That's not a requirement.
The law goes out of its way to not define what "a work" is. The US copyright system instead says "the material deposited constitutes copyrightable subject matter". A copyrightable thing is defined by being copyrightable. There's a logical loop there, allowing the law to define itself, as best makes sense. It leans on Common Law, not some definition that is written down.
"an AI-created work is likely either (1) a public domain work immediately upon creation and without a copyright owner capable of asserting rights or (2) a derivative work of the materials the AI tool was exposed to during training."
AI outputs aren't considered copyrighted, as there's no person responsible. The person has the right to copyright for the creations. A machine, does not. If the most substantial efforts involved are human, such as directly wielding a tool, then the person may incur copyright on the production. But an automated process, will not. As AI stands, the most substantial direction is not supplied by the person.
> It's simply an illegal good. It doesn't have to serve any purpose. It cannot be bound by copyright, due to the illegal nature. The Morris Worm wasn't copyrightable content.
Do you have a source that illegal works can’t be / aren’t copyrighted?
> As long as a work is original and fixed in a tangible medium of expression, it is entitled to copyright protection and eligible for registration, regardless of its content. Thus, child pornography, snuff films or any other original works of authorship that involve criminal activities are copyrightable.
It isn't that an illegal good can't be copyrighted, exactly. It's that if it is illegal, to own the copyright, you have to assert your ownership. In most cases, the consequences of which may involve the state seizing said property from you - to prevent you profiting from the crimes involved.
Nothing about this is correct, at least in the US. Copyright infringement is a civil matter - the IP owner can sue over it, but it's not a crime and the state doesn't get involved (unless something else is going on beyond just infringement).
> Piracy is not expressive content. You call it a game, and therefore it must be - but it's not. It's simply an illegal good. It doesn't have to serve any purpose. It cannot be bound by copyright, due to the illegal nature.
To be honest, reading this I have no idea what you think my post said, so I can only ask you to reread it carefully. Obviously nobody would claim "piracy is expressive content" (what would that even mean?). I said a game is expressive content, and that that's why distributing a pirated game infringes copyright.
Non-derivative doesn't mean the same as non-infringing though.
For example, suppose if I photograph a copyrighted painting, and then started selling copies of the slightly-cropped photo. The output wouldn't have enough originality to qualify as a derivative work (let alone an original work) but it would still be infringement against the painter.
If you added something to the painting then you're selling a derivative work, and if you didn't then you're selling a copy of the work itself - but either way an expressive work is being used, which is what copyright law regulates. IANAL, but with LLM models and outputs that seems not to be the case.
> training on copyrighted data without a similar licensing agreement is also a type of market harm, because it deprives the copyright holder of a source of revenue
I would respond to this by
1. authors don't actually get revenue from royalties, instead it's all about add revenue which leads to enshittification. If they were to live on royalties they would die of hunger, artists, copywriters and musicians.
2. copyright is increasingly concentrated in the hands of a few companies and don't really benefit the authors or the readers
3. actually the competition to new creative works is not AI, but old creative works that have been accumulating for 25 years on the web
I don't think restrictive copyright is what we need. Instead we have seen people migrate from passive consumption to interactivity, we now prefer games, social networks and search engines to TV, press and radio. Can't turn this trend back, it was created by the internet. We have now wikipedia, github, linux, open source, public domain, open scientific publications and non-restrictive environments for sharing and commenting.
If we were to take the idea of protecting copyrights to the extreme, it would mean we need to protect abstract ideas not just expression, because generative AI can easily route around that. But if we protected abstractions from reuse, it would be a disaster for creativity. I just think copyright is a dead man walking at this point.
When does generative AI qualify for fair use? by Suchir Balaji