22 Aug 2023

A library of AI training with copyrighted books

The Atlantic has conducted an inquiry which has uncovered that well-known generative AI models, such as Meta’s open source Llama, were trained in part using illicitly obtained copies of books written by acclaimed authors.

An investigation by The Atlantic has revealed that popular generative AI models, including Meta’s open source Llama, were partially trained using pirated versions of books by leading authors. This includes models such as BloombergGPT and GPT-J from the nonprofit EleutherAI. The pirated books, which consisted of approximately 170,000 titles published within the past 20 years, were part of a larger dataset called the Pile, which was freely available online until recently. Among the authors whose works were copied without permission are renowned names like Stephen King, Margaret Atwood, Haruki Murakami, and Jonathan Franzen. Notably, Sarah Silverman and two other authors have already filed a lawsuit against Meta and OpenAI for copyright infringement.

The person responsible for releasing the dataset claimed it was done to provide “OpenAI-grade training data” to others. While some developers may argue fair use, others may have been unaware they were using copyrighted material. The legal implications surrounding the use of copyrighted data to train AI models remain unresolved. However, EleutherAI is working on creating a version of the Pile that exclusively contains documents licensed for such use.

A library of AI training with copyrighted books

Related topics

Related technologies

Related videos

Related news