![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Github recently announced Copilot, a machine learning system that makes suggestions for you when you're writing code. It's apparently trained on all public code hosted on Github, which means there's a lot of free software in its training set. Github assert that the output of Copilot belongs to the user, although they admit that it may occasionally produce output that is identical to content from the training set.
Unsurprisingly, this has led to a number of questions along the lines of "If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?". This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.
But free software licenses only have power to the extent that copyright permits them to. If your code isn't a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot's output doesn't count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don't apply to the output. Some people have interpreted this as an attack on free software - Copilot may insert code that's either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.
I'm completely unqualified to hold a strong opinion on whether Github's legal position is justifiable or not, and right now I'm also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn't? Having been involved in a bunch of GPL enforcement activities, it's very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that's shifted over the past few days.
Let's look at the GNU manifesto, specifically this section:
The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.
The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.
The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle's argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple's copyright, we might be living in a future where we had no free software desktop environments.
When we argue for an interpretation of copyright law that enhances the power of the GPL, we're also enhancing the power of giant corporations with a lot of lawyers on hand. So let's look at this another way. If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won't enter the commons, but the ideas it embodies will. No more worries about whether you're literally copying the code that implements an algorithm you want to duplicate - simply start typing and let the model remove the risk for you.
There's a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It's not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren't under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don't know how to measure that. But what I do know is that free software was founded in a belief that software shouldn't be constrained by copyright, and our default stance shouldn't be to argue against the idea that copyright is weaker than we imagined.
(Edit: this post by Julia Reda makes some of the same arguments, but spends some more time focusing on a legal analysis of why having copyright cover the output of Copilot would be a problem)
Unsurprisingly, this has led to a number of questions along the lines of "If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?". This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.
But free software licenses only have power to the extent that copyright permits them to. If your code isn't a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot's output doesn't count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don't apply to the output. Some people have interpreted this as an attack on free software - Copilot may insert code that's either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.
I'm completely unqualified to hold a strong opinion on whether Github's legal position is justifiable or not, and right now I'm also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn't? Having been involved in a bunch of GPL enforcement activities, it's very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that's shifted over the past few days.
Let's look at the GNU manifesto, specifically this section:
The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.
The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.
The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle's argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple's copyright, we might be living in a future where we had no free software desktop environments.
When we argue for an interpretation of copyright law that enhances the power of the GPL, we're also enhancing the power of giant corporations with a lot of lawyers on hand. So let's look at this another way. If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won't enter the commons, but the ideas it embodies will. No more worries about whether you're literally copying the code that implements an algorithm you want to duplicate - simply start typing and let the model remove the risk for you.
There's a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It's not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren't under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don't know how to measure that. But what I do know is that free software was founded in a belief that software shouldn't be constrained by copyright, and our default stance shouldn't be to argue against the idea that copyright is weaker than we imagined.
(Edit: this post by Julia Reda makes some of the same arguments, but spends some more time focusing on a legal analysis of why having copyright cover the output of Copilot would be a problem)
Assumptions assumptions
Date: 2021-07-13 10:58 am (UTC)I think the likely problematic part is the straightforward reappropriation of all copyright.
I don't think that the use of data for training is much of a copyright-relvant thing (and indeed, the freedom to do so...). I also don't think that in general trained models should necessarily have a copyrightable relation to the inputs.
he truth is that if you have a reasonably uniquely named class and prompt with
class MyElaborateCreation:
it will happily reproduce the entire code. Certainly chunks large enough that copyright would seem to apply squarely. As such it has all the smell of a circumvention device.Julia Reda (like Luis Vila before her) appear to make the assumption (based on early examples like the rsqrt trick) that the reproduction will be rare small samples. I have serious doubts that if you get to "here is the original", "here is the infringing other source" to the point where there is clear infringement, "but it was generated by an AI" is an excuse. In fact, I do believe that that the most legitimate legal opinions will leave out all the AI references and just judge copyright infringement by what is the allegedly infringing work relative to what is the original work.
Now, it would be interesting if someone tried to send a DMCA-notice to copilot by finding an infringement on code they own. I am sure this will happen eventually and then we will see.
no subject
Date: 2021-07-13 11:26 am (UTC)Is that really true? Proprietary source code is usually not available, at least not legally (and I doubt that it would be legal to train a model using illegally obtained code). And if it is available, is using it to train a model allowed? For free software, this seems obvious, because of freedom 1 and maybe 0. But do proprietary licenses allow using code in such a way? Or is this something they can't restrict because of limitations/exceptions to copyright?
no subject
Date: 2021-07-13 12:39 pm (UTC)All the open source and free and public licenses depend on the mechanism of copyright to demand specific payments from the users: you get to use this code and the payment is acknowledgement, says the BSD license. You get to use this code and the payment is that you must make the derivative work available under the same terms as this license, says another.
If Github's position is truly that Copilot output is *not* a derivative work, then there is no reason that they would not use *all* the code handed to them in the training corpus, public or private.
As far as I can tell, Github's basis for training Copilot at all is in their privacy terms, which give them the right to:
"parse Customer Content into a search index or otherwise analyze it on GitHub's servers; .. These rights apply to both public and Private Repositories. "
If Copilot output is a derivative work, obviously they can't use the Customer Content which is in Private Repositories and therefore Confidential Information (to use their own terms); but also obvious to me is that if Copilot is not a derivative work, the same clause that allows them to analyze public repos allows them to use private repos. The fact that they say that they don't use private repos is an acknowledgment that Copilot output is a derivative work.
no subject
Date: 2021-07-13 02:09 pm (UTC)This makes me uneasy - a bad photocopier that produces inexact copies is a machine, but it's output is clearly a matter of copyright.
Again, I'm not sure that exact excerpts are necessary. Imagine a photocopier that produced a photographic negative of its input. No part of the output image would be the same as the input image. There is no part of the output you can point to and say "this is a copy of the input" - but it's still clearly a derived work, and subject to copyright.
In fact, I'm reminded of the classic What Colour are your bits?:
It matters where the bits came from. Even if they're not the same bits as the originals.
no subject
Date: 2021-07-17 10:53 am (UTC)People use other kinds of automatic code generators all the time......so that means the output of those code generators can't be copyrighted based on her statement. In fact, this applies to lexing, parsing and a lot of other activities as well.
Honestly, it's an easily countered and crappy argument to make.
no subject
Date: 2021-07-17 10:55 am (UTC)This means anyone can create digital scans of books and freely redistribute them or even sell them.
Paranoia!
Date: 2022-11-13 01:36 am (UTC)Lo and behold!
Asymmetry
Date: 2021-07-13 02:43 pm (UTC)It seems to me that the likely outcome of "copyright doesn't cover ML model output" is that Github (or whoever) will train models on all open source code (of any license) and generated code from that can be used in proprietary software with no source made available. Meanwhile, Github is highly unlikely to train its public model on Github or Microsoft proprietary source code, and no one else can legally access that code to train a model on it.
So the overall ratchet effect is that GPL/MIT/whatever code can now be used in proprietary code bases, without regard to the original license, but proprietary code will largely never be used to train these models, and thus inaccessible to open source developers, or developers at other proprietary software companies.
You could imagine a Github Enterprise product where you can pay for a version of Copilot that is trained on both the open source corpus as well as your company's private proprietary code, accessible only to developers in your company's employ.
no subject
Date: 2021-07-14 08:01 am (UTC)I didn't follow this bit:
The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.
With copyright law and the GPL, I can GPL my software. If someone else wants to derive from it they have to transmit their (presumably improved) source with binaries, under GPL terms; they cannot (legally) take it from the commons. Anyone can study and derive from the modified work.
Without copyright law, and thus without the GPL, they can legally keep the derived source code secret. They may not be able to impose any legal control over distribution and use of binaries but they have taken the source code from the commons (at least until their next data breach). Nobody[1] else can study or derive from the modified work.
The latter doesn't seem like the target outcome?
[1] OK, people who like spending time with disassemblers a bit can study it, but the situation is a lot worse than having the source.
(No strong opinions about Copilot other than I'm not touching it until the legal situation is a lot clearer.)
The twin sides of copyright
Date: 2021-10-17 05:01 am (UTC)With copyright law, hypothetical software creator OrangeKey can create their UnBalloon software and sell it. Then at part of that, they can tell you "we will allow you to use this software, but you can't give copies of it to anyone else" (implying "or we will take you to court, and you will lose").
Without copyright law, once you buy a copy, you can just hand out copies to people you know -- or if you're the helpful type, to anyone who asks. This is not as useful to programmers as being able to look at copyleft code, true, but it's about equally useful to most average users.
Of course, things have changed a lot since the idea of copyleft was first set upon, and I'm not sure this side is as major a factor anymore. Even the proprietary stuff is pretty much available like
waterair nowadays -- it's just festooned with ads and spyware that it won't run without. Either that, or it's "SaaS" and all the useful bits are actually running on someone else's computer (and much of that still has the ads and spyware). The internet at a distribution method makes the "source" part of copyleft and GPL seem a lot more important these days.no subject
Date: 2021-07-14 10:57 am (UTC)What about programmers?
Date: 2021-07-15 07:25 am (UTC)no subject
Date: 2021-07-17 10:45 am (UTC)If you looked at confidential source code, say the files that implemented Windows scheduler. And then just wrote the exact same thing in another file, manually typing it out letter by letter. Then is the latter your own original work? What if you just changed the names of variables and classes? Is that also your original work? What do you think copyright law and the courts would say if you were sued?
Microsoft would 100% sue you in that case. So no, they certainly don't get to steal significant chunks of GPL and similarly licensed code and get to copy it as they wish and spin the result as original work. There is no defense for that.
You assert that this is beneficial for free software, but this is only in a world where copyright isn't legal. Until that's changed, this is still a violation of GPL copyright license.
My views on the topic
Date: 2022-03-01 10:26 pm (UTC)Not in the slightest.
Usually the purpose of copyleft is to lead to the creation of more free software. Copyleft says "You want to create a derivative of this program? Go ahead, but it must be free software under the same license!". This can be considered successful when software which would otherwise have been released as proprietary is released as free instead.
Consider, however, the activity of training a neural network: because it requires huge amount of data, and because any additional selection criterium also add selection bias, the dataset will include source code licensed under multiple incompatible licenses (as well as unlicensed source code).
If the model is a derived work of the software, this doesn't help the free software community in any way and copyleft cannot be successful because training the model becomes impossible: nothing is released as free software as a result of source code under the GPL.
In essence, such a view of copyright law would severely harm a whole sector (that of machine learning) with no benefit for the free software community. While it would also hurt proprietary software, hurting a whole field of application is generally not good and shouldn't be a strategy for the free software community.
If a situation occours where an activity being illegal because of copyright would hurt free and proprietary alike, without really helping anyone, the most reasonable opinion for a community which stands for freedom would be to support the idea that such activity is not (or shouldn't be) illegal.
In addition, while when it comes to software there is an asimmetry between free software and proprietary software (the source code of most proprietary software is unavailable and will not end up in training datasets), this is not so for other kinds of work, to which copyright also applies.
When thinking of what readings of copyright law to support, then, it is imperative to consider those kinds of work as well. If including works in trainking datasets is allowed, this is good for the field of machine learning. And while it doesn't hurt the authors of the works in the training dataset, allowing such activity is necessarely closer to the intention of free licenses than proprietary ones.
In general, then, a reading of copyright law which allows for using copyrighted works in training dataset would benefit more the authors of free works, including software, than those of proprietary works and it's a kind of freedom the "free world" should support.
Yesterday I commented on this topic on the Libreplanet mailing list. I wish I had cited this blogpost as well: https://lists.libreplanet.org/archive/html/libreplanet-discuss/2022-02/msg00071.html