mjg59 | Does free software benefit from ML models being derived works of training data?

Github recently announced Copilot, a machine learning system that makes suggestions for you when you're writing code. It's apparently trained on all public code hosted on Github, which means there's a lot of free software in its training set. Github assert that the output of Copilot belongs to the user, although they admit that it may occasionally produce output that is identical to content from the training set.

Unsurprisingly, this has led to a number of questions along the lines of "If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?". This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.

But free software licenses only have power to the extent that copyright permits them to. If your code isn't a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot's output doesn't count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don't apply to the output. Some people have interpreted this as an attack on free software - Copilot may insert code that's either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.

I'm completely unqualified to hold a strong opinion on whether Github's legal position is justifiable or not, and right now I'm also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn't? Having been involved in a bunch of GPL enforcement activities, it's very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that's shifted over the past few days.

Let's look at the GNU manifesto, specifically this section:

The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.

The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.

The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle's argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple's copyright, we might be living in a future where we had no free software desktop environments.

When we argue for an interpretation of copyright law that enhances the power of the GPL, we're also enhancing the power of giant corporations with a lot of lawyers on hand. So let's look at this another way. If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won't enter the commons, but the ideas it embodies will. No more worries about whether you're literally copying the code that implements an algorithm you want to duplicate - simply start typing and let the model remove the risk for you.

There's a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It's not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren't under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don't know how to measure that. But what I do know is that free software was founded in a belief that software shouldn't be constrained by copyright, and our default stance shouldn't be to argue against the idea that copyright is weaker than we imagined.

(Edit: this post by Julia Reda makes some of the same arguments, but spends some more time focusing on a legal analysis of why having copyright cover the output of Copilot would be a problem)

Flat | Top-Level Comments Only

From: (Anonymous)

I think the likely problematic part is the straightforward reappropriation of all copyright.

I don't think that the use of data for training is much of a copyright-relvant thing (and indeed, the freedom to do so...). I also don't think that in general trained models should necessarily have a copyrightable relation to the inputs.

he truth is that if you have a reasonably uniquely named class and prompt with class MyElaborateCreation: it will happily reproduce the entire code. Certainly chunks large enough that copyright would seem to apply squarely. As such it has all the smell of a circumvention device.

Julia Reda (like Luis Vila before her) appear to make the assumption (based on early examples like the rsqrt trick) that the reproduction will be rare small samples. I have serious doubts that if you get to "here is the original", "here is the infringing other source" to the point where there is clear infringement, "but it was generated by an AI" is an excuse. In fact, I do believe that that the most legitimate legal opinions will leave out all the AI references and just judge copyright infringement by what is the allegedly infringing work relative to what is the original work.

Now, it would be interesting if someone tried to send a DMCA-notice to copilot by finding an infringement on code they own. I am sure this will happen eventually and then we will see.

From: (Anonymous)

If Github's interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted.
Is that really true? Proprietary source code is usually not available, at least not legally (and I doubt that it would be legal to train a model using illegally obtained code). And if it is available, is using it to train a model allowed? For free software, this seems obvious, because of freedom 1 and maybe 0. But do proprietary licenses allow using code in such a way? Or is this something they can't restrict because of limitations/exceptions to copyright?

From:

dsrtao

In the US [Github being owned by Microsoft, we can assume that US laws apply] copyright exists at the first act of fixing the form, which is to say writing it down and saving it. The author can renounce the copyright by dedicating it to the public domain; other than that, everything copyrightable is, by default, copyrighted.

All the open source and free and public licenses depend on the mechanism of copyright to demand specific payments from the users: you get to use this code and the payment is acknowledgement, says the BSD license. You get to use this code and the payment is that you must make the derivative work available under the same terms as this license, says another.

If Github's position is truly that Copilot output is *not* a derivative work, then there is no reason that they would not use *all* the code handed to them in the training corpus, public or private.

As far as I can tell, Github's basis for training Copilot at all is in their privacy terms, which give them the right to:

"parse Customer Content into a search index or otherwise analyze it on GitHub's servers; .. These rights apply to both public and Private Repositories. "

If Copilot output is a derivative work, obviously they can't use the Customer Content which is in Private Repositories and therefore Confidential Information (to use their own terms); but also obvious to me is that if Copilot is not a derivative work, the same clause that allows them to analyze public repos allows them to use private repos. The fact that they say that they don't use private repos is an acknowledgment that Copilot output is a derivative work.

From:

grok_mctanys

From Julia's post:

Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either. The output of a machine simply does not qualify for copyright protection – it is in the public domain.

This makes me uneasy - a bad photocopier that produces inexact copies is a machine, but it's output is clearly a matter of copyright.

This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement.

Again, I'm not sure that exact excerpts are necessary. Imagine a photocopier that produced a photographic negative of its input. No part of the output image would be the same as the input image. There is no part of the output you can point to and say "this is a copy of the input" - but it's still clearly a derived work, and subject to copyright.

In fact, I'm reminded of the classic What Colour are your bits?:

The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn't get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work, you Commie Mutant Traitor.

It matters where the bits came from. Even if they're not the same bits as the originals.

Edited Date: 2021-07-13 02:10 pm (UTC)

From: (Anonymous)

In regards to Julia's statement that you quoted:

People use other kinds of automatic code generators all the time......so that means the output of those code generators can't be copyrighted based on her statement. In fact, this applies to lexing, parsing and a lot of other activities as well.

Honestly, it's an easily countered and crappy argument to make.

From: (Anonymous)

This also calls to mind Google's scanning of copyrighted paper versions of books. Since a machine scans and produces the output, does that mean copyright no longer applies and the latter is considered an original work?

This means anyone can create digital scans of books and freely redistribute them or even sell them.

From: (Anonymous)

Really interesting article you linked to. After I read the quote, I had to find out if it was in reference to Paranoia!™

Lo and behold!

From:

nolanl

It seems to me that the likely outcome of "copyright doesn't cover ML model output" is that Github (or whoever) will train models on all open source code (of any license) and generated code from that can be used in proprietary software with no source made available. Meanwhile, Github is highly unlikely to train its public model on Github or Microsoft proprietary source code, and no one else can legally access that code to train a model on it.

So the overall ratchet effect is that GPL/MIT/whatever code can now be used in proprietary code bases, without regard to the original license, but proprietary code will largely never be used to train these models, and thus inaccessible to open source developers, or developers at other proprietary software companies.

You could imagine a Github Enterprise product where you can pay for a version of Copilot that is trained on both the open source corpus as well as your company's private proprietary code, accessible only to developers in your company's employ.

From:

ewx

I didn't follow this bit:

The GPL makes use of copyright law to ensure that GPLed work can't be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren't copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn't exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.

With copyright law and the GPL, I can GPL my software. If someone else wants to derive from it they have to transmit their (presumably improved) source with binaries, under GPL terms; they cannot (legally) take it from the commons. Anyone can study and derive from the modified work.

Without copyright law, and thus without the GPL, they can legally keep the derived source code secret. They may not be able to impose any legal control over distribution and use of binaries but they have taken the source code from the commons (at least until their next data breach). Nobody[1] else can study or derive from the modified work.

The latter doesn't seem like the target outcome?

[1] OK, people who like spending time with disassemblers a bit can study it, but the situation is a lot worse than having the source.

(No strong opinions about Copilot other than I'm not touching it until the legal situation is a lot clearer.)

From: (Anonymous)

This is true, but there's another side to it that this comment doesn't seem to cover.

With copyright law, hypothetical software creator OrangeKey can create their UnBalloon software and sell it. Then at part of that, they can tell you "we will allow you to use this software, but you can't give copies of it to anyone else" (implying "or we will take you to court, and you will lose").

Without copyright law, once you buy a copy, you can just hand out copies to people you know -- or if you're the helpful type, to anyone who asks. This is not as useful to programmers as being able to look at copyleft code, true, but it's about equally useful to most average users.

Of course, things have changed a lot since the idea of copyleft was first set upon, and I'm not sure this side is as major a factor anymore. Even the proprietary stuff is pretty much available like ~~water~~ air nowadays -- it's just festooned with ads and spyware that it won't run without. Either that, or it's "SaaS" and all the useful bits are actually running on someone else's computer (and much of that still has the ads and spyware). The internet at a distribution method makes the "source" part of copyleft and GPL seem a lot more important these days.

From:

bens_dad

Will this reduce the release of publicly visible, but not open licenced software, to git hub ? It is going to make it much easier to reuse relevant code already in github and give a defense when the originator complains. As a commercial originator I would be rethinking making my code publicly visible in github.

From:

jmvalin

How about the whole problem be treated like a much more ancient intelligent device called the Programmer. We all get exposed to all kinds code under all kinds of copyright licenses during our careers. In that context, our experience is pretty similar to a machine learning model and it seems like the output should be treated similarly. If a programmer switches job and "copies" a significant chunk of code from their previous employer (even from memory), I suspect it would cause copyright concerns. OTOH, if they use general experience acquired from previous jobs, then I don't see how that would be an issue. So programmers already need to think and ensure they don't blindly copy code they've seen before. And it seems like any ML-based code generator should do the same somehow (or risk getting their users into trouble).

From: (Anonymous)

It all depends on how much of the code is reproduced, and how identical it is to the original.

If you looked at confidential source code, say the files that implemented Windows scheduler. And then just wrote the exact same thing in another file, manually typing it out letter by letter. Then is the latter your own original work? What if you just changed the names of variables and classes? Is that also your original work? What do you think copyright law and the courts would say if you were sued?

Microsoft would 100% sue you in that case. So no, they certainly don't get to steal significant chunks of GPL and similarly licensed code and get to copy it as they wish and spin the result as original work. There is no defense for that.

You assert that this is beneficial for free software, but this is only in a world where copyright isn't legal. Until that's changed, this is still a violation of GPL copyright license.

From: (Anonymous)

Does free software benefit from ML models being derived works of training data?

Not in the slightest.

Usually the purpose of copyleft is to lead to the creation of more free software. Copyleft says "You want to create a derivative of this program? Go ahead, but it must be free software under the same license!". This can be considered successful when software which would otherwise have been released as proprietary is released as free instead.

Consider, however, the activity of training a neural network: because it requires huge amount of data, and because any additional selection criterium also add selection bias, the dataset will include source code licensed under multiple incompatible licenses (as well as unlicensed source code).

If the model is a derived work of the software, this doesn't help the free software community in any way and copyleft cannot be successful because training the model becomes impossible: nothing is released as free software as a result of source code under the GPL.

In essence, such a view of copyright law would severely harm a whole sector (that of machine learning) with no benefit for the free software community. While it would also hurt proprietary software, hurting a whole field of application is generally not good and shouldn't be a strategy for the free software community.

If a situation occours where an activity being illegal because of copyright would hurt free and proprietary alike, without really helping anyone, the most reasonable opinion for a community which stands for freedom would be to support the idea that such activity is not (or shouldn't be) illegal.

In addition, while when it comes to software there is an asimmetry between free software and proprietary software (the source code of most proprietary software is unavailable and will not end up in training datasets), this is not so for other kinds of work, to which copyright also applies.

When thinking of what readings of copyright law to support, then, it is imperative to consider those kinds of work as well. If including works in trainking datasets is allowed, this is good for the field of machine learning. And while it doesn't hurt the authors of the works in the training dataset, allowing such activity is necessarely closer to the intention of free licenses than proprietary ones.

In general, then, a reading of copyright law which allows for using copyrighted works in training dataset would benefit more the authors of free works, including software, than those of proprietary works and it's a kind of freedom the "free world" should support.

Yesterday I commented on this topic on the Libreplanet mailing list. I wish I had cited this blogpost as well: https://lists.libreplanet.org/archive/html/libreplanet-discuss/2022-02/msg00071.html

Flat | Top-Level Comments Only

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at nvidia. Ex-biologist. Content here should not be interpreted as the opinion of my employer. Also on Mastodon and Bluesky.

Page Summary

(Anonymous) - Assumptions assumptions
(Anonymous) - (no subject)
grok_mctanys - (no subject)
nolanl - Asymmetry
ewx - (no subject)
bens_dad - (no subject)
jmvalin - What about programmers?
(Anonymous) - (no subject)
(Anonymous) - My views on the topic

Expand Cut Tags

No cut tags