Meta Platforms Inc., the parent company of Facebook and Instagram, is facing mounting legal troubles as it battles allegations of copyright infringement. Lawyers for the tech giant had reportedly cautioned against using pirated books for training its AI models, but according to a recent court filing, Meta proceeded with the practice.
Read: US judge dismisses parts of AI copyright lawsuit against Meta
The new complaint, filed late on Monday night, consolidates two separate lawsuits brought against Meta by notable figures, including comedian Sarah Silverman and Pulitzer Prize winner Michael Chabon. These authors claim that Meta utilised their copyrighted works without permission to train its artificial-intelligence language model, Llama.
A California judge had partially dismissed the Silverman lawsuit last month, hinting that the authors could amend their claims and Meta has not yet issued a statement in response to these allegations.

The latest complaint includes chat logs revealing discussions within Meta about obtaining a dataset containing thousands of pirated books. This evidence suggests that Meta was aware that its use of these books might not be protected by US copyright law.
In the quoted chat logs, Meta-affiliated researcher Tim Dettmers discussed the dataset procurement in a Discord server. He described his conversations with Meta’s legal department regarding the legality of using the books as training data.
Read: More authors sue OpenAI and Meta over copyright due to training
“At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons,” Dettmers wrote in 2021 according to Reuters, referring to a dataset acknowledged by Meta as being used to train the initial version of Llama, according to the complaint.
The month before, Dettmers noted that Meta’s lawyers had informed him that “the data cannot be used or models cannot be published if they are trained on that data,” the complaint revealed.
While Dettmers did not elaborate on the lawyers’ concerns, others in the chat identified “books with active copyrights” as the primary potential issue. They argued that training on the data should qualify as “fair use,” a US legal doctrine that permits certain unlicensed uses of copyrighted works.
Tech companies have been grappling with numerous lawsuits this year from content creators who accuse them of using copyright-protected materials to develop generative AI models, which have gained global popularity and attracted significant investment.
Read: Supergroup of authors including George R.R. Martin sue OpenAI
If these cases prove successful, they could impact the generative AI landscape by increasing the cost of building data-intensive models, compelling AI firms to compensate artists, authors, and content creators for the use of their intellectual property.
Additionally, new provisional rules in Europe regulating artificial intelligence could force companies to disclose the data used to train their models, potentially exposing them to more legal risks.
Meta released the initial version of its Llama large language model in February, disclosing a list of datasets used for training, including “the Books3 section of ThePile.” This dataset, as noted in the complaint, contains a staggering 196,640 books. The company, however, did not reveal the training data for its latest model, Llama 2, which became commercially available this summer.
Llama 2 is offered free of charge to companies with fewer than 700 million monthly active users. Its release has been seen as a potential disruptor in the generative AI software market, posing a challenge to established players like OpenAI and Google that charge for the use of their models.
[…] week, chipmaker Nvidia joined the ranks of OpenAI, Meta, and Microsoft after three authors sued it for alleged copyright infringement. While it may not […]