3 May 2024

Nvidia and Databricks sued for alleged copyright infringement in AI model development

Nvidia Corporation and Databricks Inc. face class-action lawsuits alleging copyright infringement in the creation of their artificial intelligence models

Nvidia Corporation and Databricks Inc. face class-action lawsuits alleging copyright infringement in the creation of their AI models. The litigation highlights a growing concern over the use of copyrighted content without permission.

The lawsuits against Nvidia corp. and Databricks Inc., filed on March 8 by authors Abdi Nazemian, Brian Keene, and Stewart O’Nan in the U.S. District Court for the Northern District of California, argue that Nvidia’s NeMo Megatron and Databricks’ MosaicML models were trained on vast datasets containing millions of copyrighted works. Notably, the complaints suggest these models include content from well-known authors like Andre Dubus III and Susan Orlean, among others, without their consent. This has sparked a broader debate on whether such practices constitute fair use, as AI developers claim, or if they infringe upon the copyrights of individual creators.

The core of the dispute lies in how AI companies compile their training data. Reports indicate that some of the data used included copyrighted material from ‘shadow libraries’ like Bibliotik, which hosts and distributes unlicensed copies of nearly 200,000 books. The involvement of such sources in training datasets could potentially undermine the legality of the AI training process, which relies on the ingestion of large volumes of text to produce sophisticated AI outputs.

Legal experts and industry analysts are closely watching these cases, as the outcomes could set important precedents for the future of AI development. Companies like Nvidia have defended their practices, stating that their development processes comply with copyright laws and emphasizing the transformative nature of AI technology. However, the plaintiffs argue that this does not justify the unauthorized use of their work, which they claim undermines their financial and creative rights.

The lawsuits against Nvidia and Databricks are part of a larger trend of legal challenges that tech giants face regarding their development of AI technologies and using copyrighted materials to train large language models (LLMs), designed to process and generate human-like text.

OpenAI, the creator of the AI model known as ChatGPT, faced similar legal scrutiny when the New York Times filed a lawsuit against it, alleging that the company used copyrighted articles to train its language models without permission.

These developments raise crucial questions about the balance between innovation and copyright protection in the digital context.