How are you using the word “copy”? It doesn’t seem to match the standard meaning. For instance, most people would not consider a brief summary of a movie’s plot to be a “copy” of that movie, or protected under copyright.
If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail, then the NN by definition contains enough information used to reconstruct that image - hence, a copy.
With NNs trained on thousands or millions of data entries, this concept becomes fuzzy in the same way as you described - a short summary likely wouldn't be considered a copy, just like a 64x64 generated thumbnail wouldn't be considered in the same way a 4096x4096 hi-res image.
The thing is, the “good” models can’t reconstruct the image in detail. It’s considered a sign of “overfitting” if you reconstruct the input exactly. Even if you put the exact query that was associated with that image, you’ll get the weighted average (feature-wise) image associated with the query. This applies to all like machine learning models without loss of generality.
Sure, but I could write a program to spew out an unbounded number of images containing random pixels. It could create an image that is identical to a copyrighted image, but if I just keep that image on my hard drive, have I violated copyright? I don't think I would be, but if I started distributing them, yes I would.
> If you have an image, then train a neural network on that image, then use the neural network to reconstruct that image in detail
I haven't seen that happening since the discussion started. Most of the complains I saw aimed at things like "it stole my style" not "it reproduced my art".
It's more, 'this product is profiting from my labor without my consent (ie paying me).'
In music you aren't allowed to use the same notes, even if you played them on a trumpet with a swing beat, while the source was on the piano very staccato.
While we don't have the same vocabulary for art, it's not unreasonable to expect similar protections.
It takes work to create/identify/classify information, both in the economic and physics sense. That work should be allowed the same protections we do other forms of work.
Your example is one where nearly no work was done, thus it doesn't deserve much value. "Let a = the set of all songs" doesn't help me find new songs I like. A songwriter does that work. Another artist that takes and uses and resells that work (without consent), is stealing that work.
To me it's funny that nearly all the problem with the current team of AI generation would be solved if the model generators simply licensed the content they train on. "But that would cost too much" Ok, just use public domain work, "But that wouldn't be as good" Oh so you are saying the work has value, but you are unwilling to pay for it, and instead your scheme is to just take it. That seems like a good definition of stealing - not paying for something that has value.
> Your example is one where nearly no work was done, thus it doesn't deserve much value
You are aware that there is very expensive art out there where the artist did not much work. Like painting a canvas in one colour or throwing an item in the corner of a museum.
According to you, that would not deserve much value but it does have a lot value in reality.
In fact "value" is what somebody else gives to the piece of art.
A prompted AI artwork made by me may have more value to me than all the art in the Louvre.
The discussion here continues to turn around copies when it's not a copy those algorithms generate.
> Another artist that takes and uses and resells that work (without consent), is stealing that work.
Another artist accidentally uses a melody from another song (because it's a finite set) and are sued for all their income is a horrible system. The winners aren't the people producing value, it's the people who got there first and are now profiting off other people's work.
If I grew up under a rock, somehow became a self taught musician, and ended up authoring a song that had recognizable components from Happy Birthday, then even still the author of Happy Birthday, having established that melody so successfully in the public zeitgeist, reasonably should benefit.
This is so common the recording industry itself has established rules for sampling and licensing and covers and what not. Are there some folks out there abusing the system, for sure. But overall its goal is to maximize the value produced by the recording industry, which very much includes the people who 'got their first' who built foundations for future artists. To me, this all seems basically reasonable.
Copyright is supposed to promote the creation of new works. You just described a system where a song written well over 100 years ago is preferred over over a new artist creating a new work.
Honestly, can people stop speaking in absolutes regarding these systems? We (researchers and non-researchers alike) are gradually trying to comprehend exactly how much they generalise and memorise, but this is darn hard work and it is not our fault that several major tech giants decided to deploy and profit from these models long before the scientific and legal landscape was clear. Somepalli et al. (2022) [1] for example is a fairly strong argument against your statement above.
The fact is that these systems are complex, new, and interesting. However, it is not the fault of small-time programmers and artists that modern copyright law is a major, overreaching mess that is now finally greatly affecting what the big corporations want to do. They are getting sued? Cry me a river… Perhaps they will finally stop backing the American-led copyright lobby then?
> is a fairly strong argument against your statement above.
From a quick skim of this paper, they apparently used toy models with a few hundred to a few thousand images in the training set. For the ones with as few as a few thousand training images, they rarely or never saw exact duplicates.
For instance, in their figure 4, they show exact duplicates for the training set with only 300 images (well, duh), and didn't find any exact duplicates for the training set with only 3,000.
I'm not sure I'd call this a "strong argument" when applied to models with millions or billions of images. Quite the contrary. LAION-5B (used in Stable Diffusion) was trained on 5 billion image/caption pairs.
Firstly, thank you for engaging in a discussion. Secondly, I am not an expert in image processing, rather my focus in on language. Thus my intuitions will not work as much in my favour in this domain, although the models do have similarities.
They explore a range of sizes and I do not think it is fair to to only highlight the smallest ones. They do explore a 12M subset of LAION in Section 7 for a model that was trained on 2B images. Yes, it is not an ideal experimental setup to use a subset (they admit this) and far from LAION-5B, but it is a fair stab at this kind of analysis and is likely to lead to further explorations.
Let us return though to your claim, which is what I objected to: “Pretty much none of these systems ‘reconstruct an image in detail’.” I think it is fair to say that this work certainly makes me doubt whether none of these systems (even the larger ones) exhibit behaviour that may limit their generalisability or cross the boundary of what is legally considered derivative work.
You may very well be right that once we scale to billions of images this behaviour is improved (or maybe even disappears), but to the best of my knowledge we do not know if this is the case and we do not know when, how, and why it occurs if it does occur. I remain a firm believer that these kinds of models are the future as there is little evidence that we have reached their limits, but I will continue to caution anyone that talks in absolutes until there is solid evidence to support those claims.