Meta’s AI Training Scandal: How 82TB of Pirated Books Sparked a Legal Firestorm

February 11, 2025

Court documents reveal Meta’s use of shadow libraries to train AI, raising ethical and legal concerns.

Meta allegedly downloaded 81.7TB of pirated books from shadow libraries like Z-Library and LibGen to train its AI models, according to court records.
Internal communications reveal ethical concerns among Meta employees, with some calling the use of pirated material a violation of copyright laws.
The case highlights broader issues in the AI industry, as companies like OpenAI and Nvidia also face lawsuits over copyright infringement in AI training.

Meta, the parent company of Facebook, is embroiled in a legal battle that has exposed its controversial methods for training artificial intelligence. Court documents from the ongoing class-action lawsuit, Kadrey v. Meta, reveal that the company allegedly downloaded 81.7TB of pirated books and other materials from shadow libraries such as Anna’s Archive, Z-Library, and LibGen. These materials were reportedly used to train Meta’s large language model, LLaMA, sparking allegations of copyright infringement and unfair competition.

The revelations, first highlighted in a post by vx-underground on X (formerly Twitter), have sent shockwaves through the tech and publishing industries. The case not only raises questions about Meta’s ethical practices but also sheds light on a broader issue: the murky legal and moral waters surrounding AI training data.

Ethical Concerns from Within

Internal communications among Meta employees, unsealed in court documents, paint a troubling picture of the company’s approach to AI training. As early as October 2022, some employees expressed serious ethical concerns about using pirated materials. One senior AI researcher stated, “I don’t think we should use pirated material. I really need to draw a line here.” Another employee echoed this sentiment, saying, “Using pirated material should be beyond our ethical threshold,” and likened shadow libraries like LibGen and SciHub to notorious piracy platforms such as The Pirate Bay.

Despite these concerns, the company appeared to press forward. In January 2023, Meta CEO Mark Zuckerberg attended a meeting where he reportedly said, “We need to move this stuff forward… we need to find a way to unblock all this.” By April 2023, employees were discussing ways to conceal Meta’s involvement in torrenting activities, including using VPNs to mask corporate IP addresses. One employee even joked, “Torrenting from a corporate laptop doesn’t feel right,” followed by a laughing emoji.

These communications suggest that while some employees were uneasy about the ethical implications, the company as a whole took deliberate steps to circumvent copyright laws.

A Broader Industry Problem

Meta is not the only tech giant facing scrutiny over its AI training practices. The lawsuit against Meta is part of a growing trend of legal challenges targeting AI companies for their use of copyrighted material.

In June 2023, OpenAI was sued by a group of novelists who alleged that their books were used without permission to train ChatGPT. Later that year, The New York Times filed a similar lawsuit, accusing OpenAI of scraping its content. Nvidia, another major player in the AI space, faced a lawsuit for using nearly 200,000 books to train its NeMo model. A whistleblower revealed that Nvidia had also scraped over 426,000 hours of video content daily for AI training.

These cases highlight a systemic issue: the lack of clear legal frameworks governing the use of copyrighted material in AI training. While companies argue that scraping publicly available data falls under “fair use,” critics contend that this practice undermines intellectual property rights and devalues creative work.

The Road Ahead

The case against Meta is still ongoing, and the outcome remains uncertain. If the court rules in favor of the plaintiffs, it could set a significant precedent for how AI companies source their training data. However, given Meta’s vast financial resources, the company is likely to appeal any unfavorable decision, potentially dragging the case out for years.

Regardless of the legal outcome, the revelations about Meta’s practices have already sparked a broader conversation about ethics in AI development. As AI continues to play an increasingly central role in society, companies will face growing pressure to ensure that their models are trained responsibly and transparently.

The Meta case also underscores the need for stronger regulations to govern the use of copyrighted material in AI training. Without clear guidelines, the industry risks perpetuating a cycle of legal disputes and ethical controversies.

The allegations against Meta serve as a stark reminder of the ethical and legal challenges facing the AI industry. While the promise of AI is immense, its development must not come at the expense of intellectual property rights or ethical standards. As the case unfolds, it will not only shape the future of Meta but also set the tone for how the tech industry navigates the complex intersection of AI, copyright, and ethics.

The world will be watching closely, as the outcome of this case could redefine the boundaries of what is acceptable in the race to build ever-more powerful AI systems.

Source

Image source

Court documents reveal Meta’s use of shadow libraries to train AI, raising ethical and legal concerns.

Ethical Concerns from Within

A Broader Industry Problem

The Road Ahead

RELATED ARTICLES

Must Read