This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
I'd argue that a neural net is a derivative work of its training data, so its mere creation is a copyright violation.
But you could make a similar argument that a human brain is a derivative work of its training data. Obviously there are huge differences, but are those differences relevant to the core argument? A neural net takes a bunch of stuff it's seen before and then combines ideas and concepts from them in a new form. A human takes a bunch of stuff they've seen before and then combines ideas and concepts from them in a new form. Copyright laws typically allow for borrowing concepts and ideas from other things as long as the new work is transformative and different enough that it isn't just a blatant ripoff. Otherwise you couldn't even have such a thing as a "genre", which all share a bunch of features that they copy from each other.
So it seems to me that, if a neural net creates content which is substantially different from any of its inputs, then it isn't copying them in a legal sense or moral sense, beyond that which a normal human creator who had seen the same training data and been inspired by them would be copying them.
The dystopian take is obviously that the copyright lawyers will come for the brain next: experiencing copyrighted media without paying for it will be criminalized.
More options
Context Copy link
That's an entirely different question. Obviously the LLM is not itself a human, but neither is a typwriter or computer which a human uses as a tool to write something. So probably the copyright author would be the person who prompts the LLM and then takes its output and tries to publish it. Especially if they are responsible for editing its text and don't just copy paste it unchanged. You could make an argument that the LLM creator is the copyright holder, or that the LLM is responsible for its own output which is then uncopyrightable since it wasn't produced by a human.
But regardless of how you address the above question, it doesn't change my main point that the AI does not violate copyrights of humans it uses input from in any way differently from a human doing the same things that it does. Copyright law is complicated, but there's a long history and a lot of precedents and individual issues tend to get worked out. For this purpose, the LLM, or a human using LLM as an assistant, should be subject to the same constraints that human creators already are. They're not "stealing" any more or less than humans already do by consuming each other's work. You don't need special laws or rules or restrictions on it that don't already exist.
You can't reason by analogy with what humans do because LLMs are not human. They are devices, which contain data stored on media. If that data encodes copyrighted works, they are quite possibly copyright violations. If I memorize the "I have a dream" speech, the King estate can do nothing to me. They can bust me for reciting it in public, but I can recite in in private all I want (though I could get in trouble for writing it down). If I can ask an LLM for the "I have a dream" speech and it produces it, I have proven that the LLM contains a copy of the "I have a dream" speech and is therefore a copyright violation. And that's just the reproduction right; the derivative work right is even wider.
Except that LLM don't explicitly memorize any text, they generate it. It's the difference between storing an explicit list of all numbers 1 to 100 {1,2,3...100}, and storing a set of instructions: {f_n = n: n in [1,100]} that can be used to generate the list. It has a complicated set of relationships between words that it understands, and is very refined such that if it sees the words "Recite the "I have a dream" speech verbatim", it has a very good probability of successfully saying each of the words correctly. At least I think the better versions do, many of them would not actually get it word for word, because none of them have it actually memorized, they're generating it new.
Now granted, you can strongly argue, and I would tend to agree, that a word for word recitation by a LLM of a copyrighted work is a copyright violation, but this is analogous to being busted for reciting it in public. The LLM learning from copyrighted works is not a violation, because during training it doesn't copy them, it learns from them and changes its own internal structure in ways that improve its generating function such that it's more capable of producing works similar to them, but does not actually copy them or remember them directly. And it doesn't create an actual verbatim copy unless specifically asked to (and even then is likely to fail because it doesn't have a copy stored and has to generate it from its function)
Imagine I create some wacky compression format that will encode multiple copyrighted images into a single file, returning one of them when you hit it with an encryption key corresponding to the name of the image -- the file "contains" nothing but line noise, but if you run it through the decompressor with the key "Mickey Mouse" it will return a copyright violation.
Is passing this file around on Napster (or whatever the kids are doing these days) a copyright violation?
I'm pretty sure it is, if that's the intended use case of that file, and people other than you know about the decryption method. On the other hand, literally any data file of a certain length (call it A) can be turned into literally any other data file of the same length (B) if you hit it with exactly the right "decryption" (B-A) by just adding the bits together. So if you take this idea too far, every file is secretly an encrypted Mickey Mouse to the right code.
There's something nontrivial in here about information theory. If the copyrighted image has 500 kb of data, and your "encrypted file" is 500 kb, and the decryption key "Mickey Mouse" is 12 bytes, then clearly the file must contain the copyright violation. If you make an "encrypted file" with 12 kb and some wacky compression algorithm that requires 500 kb to encode and is specifically designed to transform the string "Mickey Mouse" into a copyrighted image, then yeah, that algorithm is a copyright violation.
On the other hand, if you use a random number generator to generate a random 500 kb number A, and then compute C = (B - A) where B is your copyrighted image, then in isolation both A and C are random numbers. If you just distribute A and nobody has any way of knowing or guessing C, then no copyright violation has occurred. If you just distribute C and nobody has any way of knowing or guessing A, then no copyright violation has occurred. But if you distribute them together, or if you distribute one and someone else distributes the other, or if one of them is a commonly known or guessable number, then you're probably violating copyright and trying to get away on a technicality.
But it's not enough for it to simply be possible to "decrypt" something into another thing. A string of pure 0s can be "decrypted" into any image or text. A word processor will generate any copyrighted text if the user presses the right keys in the right combination. I think there has to be some level of intent or ease or information theory value such that the file is doing the majority of the work.
So I'll concede that you make a LLM that will easily reproduce copyrighted material with simple descriptions and passwords, then I can see there being issues there. Similar to how if an author keeps spitting out blatant ripoffs of copyrighted works with a couple of words changed they'll get in trouble. But simply having used them in the training material is not itself a copyright violation. A robust LLM that has trained on lots of copyrighted materials but refuses to replicate them verbatim is not a copyright violation simply for having learned from them (which seems to be the primary objection that artists are having, not the actual reproduction of their work which I would agree is bad).
More options
Context Copy link
Indeed, I claim that's closer to description than analogy. An LLM is a way of encoding (lossily) a whole lot of textual data in a very opaque form, in a way that you can get much of that data out by giving fairly intuitive prompts.
More options
Context Copy link
More options
Context Copy link
From 17 USC 101
Thus an LLM which can reproduce a copyrighted work is a copy of it, regardless of whether it is "generating" it or not.
According to your interpretation a pencil and a blank notebook would be a copy of a copyrighted work - it is entirely possible to use that pencil to write down a copy of the script to a recent Hollywood movie, which means that any blank piece of media is effectively a copy of any copyrighted work which can be expressed upon it.
No, the blank notebook is clearly not a copy of a copyrighted work, until you actually write down the script; the work is not fixed in it. There's no way to get the blank notebook to reproduce the copyrighted work without putting it on the notebook yourself. If you could write "The Fabelmans" at the top of the blank notebook and the rest of the script would appear, THEN you could validly argue the (apparently) blank notebook is a copy of a copyrighted work.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Once again it is unclear to me how this is very different than a human who reads a bunch of scripts/novels/poems and then produces something similar to what he studied.
There’s a lot of different ways you could look at it, but I think I might just say that the principle of “if you use someone else’s work to build a machine that replaces their job, then you have a responsibility to compensate that person” just seems axiomatic to me. To say that the original writers/artists/etc are owed nothing, even though the AI literally could not exist without them, is just blatantly unfair.
Is it not different from the early factory laborers buildings the machines that would replace them? Or maybe more aptly the carriage companies that supplied the ford factories. They were paid for the production fairly enough. That was the narrow agreement, not that no one else could be inspired by it or build an inspiration machine. To be replaced by people who were inspired by your works is the fate of every artist in history, at least those that didn't wallow in obscurity.
They consented and were paid. It's not analogous at all.
They produced the media, which is being consumed and paid for under the current payment model. They are being compensated for it regularly and under all the fair and proper agreements. The AI is trained off of the general inspiration in the air, which is also where they artists pulled their own works for. It's a common resource emitted by everyone in the culture to some degree. The Disney corporation did not invent their stories from whole cloth, they took latent myths and old traditional tales from us and packaged it for us and the ideas return to us. Now we're going to have a tool that can also tap this common vein and more equitably? This is democratization. This is equality. This is progress.
Last week I not so popularly defended copyright, and I still believe it's the best compromise available to us. But it doesn't exist because of a fundamental right, it exists because it solves a problem we have with incentivizing the upfront cost of creating media. If these costs can be removed from the equation then the balance shifts.
How do you feel about software license agreements? Plenty of software/code is publicly visible on the internet and can be downloaded for free, but it's accompanied by complex licensing terms that state what you can and can't do with it, and you agree to those terms just by downloading the software. Do you think that license agreements are just nonsense? Once something is out on the internet then no one can tell you what you can and can't do with it?
If you think that once a sequence of bits is out there, it's out there, and anyone can do anything with it, then it would follow that you wouldn't see anything wrong with AI training as well.
In the case of images for training image generation AI, I don't think they were published on the internet with license agreements stipulating limitations on what people could do with the images after downloading them. IANAL, but I think if you publish an image online and don't gate it in some way like behind a password login and license agreement, you get nothing beyond the default basic legal protections for how people use it, i.e. not publishing a copy or derivative work and whatnot.
I think many images on the internet, particulary those posted to art sites like dA, ArtStation, HentaiFoundry and maybe also Pixiv, do tend to come with a Creative Commons or Berne Convention license disclaimer thing, but as you suggest, this doesn't really do much against scraping. Artists generally ask and demand that their art not be reused without permission.
More options
Context Copy link
Because no artist knew what ML training even was until early-ish 2022.
If you think it's possible for an artist to have any rights at all over whether their work is used for AI training or not, then clearly they should be able to seek redress for images that have already been used without their permission. Trying to get off on a technicality of "aww shucks, maybe if you had published your work back in 2015 along with an explicit notice that said 'you do not have permission to use this image to train any AI models' then you would have a case, but, you didn't, so too bad..." is just silly.
I'm sure there's a Latin legal phrase that translates to "Bruh. Come on." and I would cite that principle as my justification.
But obviously if you don't think that artists have any right even going forward to control the use of their work for training then this point will be irrelevant for you.
More options
Context Copy link
More options
Context Copy link
I think such a regime could be set up, but it would not be justified in the way that current copyright is.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Because the law doesn't consider your memory or the arrangement or functioning of your neurons to be a tangible medium of expression, but a computer memory (including RAM, flash, spinning rust, paper tape, whatever) is. If you see a work and produce something similar, what you produce might indeed by a copyright violation (and there are a lot of lawsuits about that), but your mind can't itself be a copyright violation. A neural network in a computer can be.
This doesn't seem to follow either. Maybe what the computer produces could be a violation, but the actual information contained within it doesn't resemble the copyrighted material in any way we can determine. At least that's based on my understanding of how the trained AI works.
The fact that the information contained within the LLM doesn't, in some sense, resemble the copyrighted material isn't relevant, nor should it be; the fact that we can get it out demonstrates that it is in there.
Which is why I'm confused as to why this does not apply to a human memory, other than that just being the Court's distinction between human hardware and electronic systems that they apply.
And then, even accepting your point:
Then presumably putting sufficient controls on the system so that it WILL NOT produce copyrighted works on demand solves the objection.
We also run into the 'library of Babel" issue. If you have a pile of sufficiently large randomized information, then it probably contains 'copies' of various copyrighted works that can be extracted.
So an AI that is trained on and 'contained' the entire corpus of all human-created text might be said to contain copies of various works, but the incredible, vast majority of what it it contains is completely 'novel,' unrelated information which users can generate at will, too.
There may be no other distinction.
Only if the controls are inseparable from the system. If I can take your weights and use them in an uncensored system to produce the copyrighted works, they were still in there. Just as if you rigged a DVD player not to play a DVD of "Return of the Jedi" wouldn't mean "Return of the Jedi" wasn't on the DVD.
If the randomized information was generated without reference to the copyrighted works, this doesn't matter. That's not the case with the AIs; they had the copyrighted works as training data.
Yes yes, but so did the humans.
This is what gets real fraught here.
The terminal end result of harshly applying IP restrictions is preventing someone who has a sufficiently accurate memory of a copyrighted work from conveying any portion of the work to anyone else outside of a 'fair use' context.
If the technology allowed it, I have little doubt that they'd implement a system to collect royalties every time some college student plays Wonderwall on his guitar at a party.
But on the assumption that the laws haven't gotten THAT ridiculous yet, I'm inclined to suggest that the situation as it stands is that the AI is basically the equivalent of a human being with eidetic memory and thus can recall any copyrighted work it wishes, on demand, when given the correct prompting, but is also fully capable of "original" thought.
Does IP law REALLY grant the owner the absolute right of ownership over every single instance of the information that composes their work... regardless of the format, medium, or usage it is put to?
Well, we've already had lawsuits asserting that people can be found guilty of subconsciously plagarizing someone else's music: https://en.wikipedia.org/wiki/My_Sweet_Lord#Copyright_infringement_suit
More options
Context Copy link
Copyright does not restrict private performance, so if you've memorized a work you can tell it to someone else in private. You can't write it down (including typing it into a computer), you can't make other works based on it, and you can't perform it publicly (which sometimes includes performing it to many people in a serial fashion).
They do have such a system, and they (in this case, ASCAP and BMI, mainly) are famous for busting venues over it.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link