My views on the topic

Does free software benefit from ML models being derived works of training data?

Not in the slightest.

Usually the purpose of copyleft is to lead to the creation of more free software. Copyleft says "You want to create a derivative of this program? Go ahead, but it must be free software under the same license!". This can be considered successful when software which would otherwise have been released as proprietary is released as free instead.

Consider, however, the activity of training a neural network: because it requires huge amount of data, and because any additional selection criterium also add selection bias, the dataset will include source code licensed under multiple incompatible licenses (as well as unlicensed source code).

If the model is a derived work of the software, this doesn't help the free software community in any way and copyleft cannot be successful because training the model becomes impossible: nothing is released as free software as a result of source code under the GPL.

In essence, such a view of copyright law would severely harm a whole sector (that of machine learning) with no benefit for the free software community. While it would also hurt proprietary software, hurting a whole field of application is generally not good and shouldn't be a strategy for the free software community.

If a situation occours where an activity being illegal because of copyright would hurt free and proprietary alike, without really helping anyone, the most reasonable opinion for a community which stands for freedom would be to support the idea that such activity is not (or shouldn't be) illegal.

In addition, while when it comes to software there is an asimmetry between free software and proprietary software (the source code of most proprietary software is unavailable and will not end up in training datasets), this is not so for other kinds of work, to which copyright also applies.

When thinking of what readings of copyright law to support, then, it is imperative to consider those kinds of work as well. If including works in trainking datasets is allowed, this is good for the field of machine learning. And while it doesn't hurt the authors of the works in the training dataset, allowing such activity is necessarely closer to the intention of free licenses than proprietary ones.

In general, then, a reading of copyright law which allows for using copyrighted works in training dataset would benefit more the authors of free works, including software, than those of proprietary works and it's a kind of freedom the "free world" should support.

Yesterday I commented on this topic on the Libreplanet mailing list. I wish I had cited this blogpost as well: https://lists.libreplanet.org/archive/html/libreplanet-discuss/2022-02/msg00071.html

(14 comments)

My views on the topic

Post a comment in response: