Did You Know Meta Trained AI Models In Data From A Piracy Site? Here's What A Lawsuit Revealed

Meta came under scrutiny earlier this year after unredacted court documents suggested that the tech giant trained its artificial intelligence models using Library Genesis (LibGen), a well-known repository of pirated books. The revelations emerged as part of a lawsuit filed by a group of authors accusing Meta of copyright infringement.

As reported by Wired in January, the case, Kadrey et al. v. Meta Platforms, is among the earliest legal battles challenging AI training practices. Its outcome, along with other similar lawsuits, could significantly impact how AI companies use copyrighted content in their models and whether such practices fall under the doctrine of “fair use.”

ALSO READ: Whose Data Is It Anyway? Zuckerberg’s Meta Has Been Scraping All Your Public Posts Since 2007 To Train AI

Meta’s Approach To Concealing Info Criticised

A major development in the lawsuit came when Judge Vince Chhabria of the US District Court for the Northern District of California ordered the release of previously redacted court documents. Chhabria criticised Meta’s approach to concealing information, calling it “preposterous” and asserting that the company was attempting to “avoid negative publicity” rather than protect trade secrets.

One of the most notable disclosures included an internal message from a Meta employee who reportedly expressed concern about potential backlash. “If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues,” the employee noted. Meta has declined to comment on the matter.

Sued By Novelists

The lawsuit was originally filed in July 2023 by novelists Richard Kadrey and Christopher Golden, alongside comedian Sarah Silverman. They allege that Meta used their copyrighted works without permission to train its AI models. Meta has maintained that its use of publicly available materials is protected under the fair use doctrine, arguing that analyzing text to model language does not constitute direct copyright infringement.

Before these documents surfaced, Meta had acknowledged training its Llama language model using Books3, a dataset containing nearly 200,000 books. However, the company had not previously disclosed any use of data directly sourced from LibGen. The newly revealed documents suggest that Meta’s AI team was aware of the nature of the dataset, with one engineer remarking that “torrenting from a [Meta-owned] corporate laptop doesn’t feel right.”

Zuck Said To Be Aware Of Dataset

Further claims in the lawsuit allege that Meta’s leadership, including CEO Mark Zuckerberg, was aware of the dataset’s origins. Internal communications reportedly referred to him as “MZ” when discussing decisions related to using LibGen data. Plaintiffs argue that these exchanges demonstrate Meta’s knowledge of the dataset’s questionable legality.

Meta has pushed back against the plaintiffs’ attempts to amend their lawsuit, calling it “an eleventh-hour gambit based on a false and inflammatory premise.” The company contends that it disclosed its use of LibGen in July 2024 and that the plaintiffs had ample opportunity to adjust their claims before discovery ended in December.

Legal experts are closely watching the case, as its resolution could establish important precedents for AI training. While some claims in the lawsuit—such as violations of the Digital Millennium Copyright Act—were dismissed in 2023 due to insufficient evidence, plaintiffs argue that the newly revealed documents provide grounds to revisit those allegations. They also claim that Meta went beyond using pirated content for training by actively distributing it, a process known as “seeding” in torrent networks.

LibGen, which originated in Russia in 2008, remains one of the largest shadow libraries globally. Courts in the US have long attempted to shut it down, with a New York judge ordering the platform to pay $30 million in damages in 2024. Despite these legal challenges, the site continues to operate through alternative domains.

As the lawsuit progresses, Judge Chhabria has warned Meta against further attempts to broadly redact court filings, cautioning that any future overreach could lead to the unsealing of all related materials. “If Meta again submits an unreasonably broad sealing request, all materials will simply be unsealed,” he stated.

The case could have far-reaching implications for AI development, copyright law, and the way tech companies navigate intellectual property concerns in the age of generative AI.

Please follow and like us: