Zuckerberg approved Meta’s use of ‘pirated’ books to train AI models, authors claim | Mark Zuckerberg newsthirst.


Mark Zuckerberg approved Meta’s use of “pirated” versions of copyright-protected books to train the company’s artificial intelligence models, a group of authors has alleged in a US court filing.

Citing internal Meta communications, the filing claims that the social network company’s chief executive backed the use of the LibGen dataset, a vast online archive of books, despite warnings within the company’s AI executive team that it is a dataset “we know to be pirated”.

The internal message says that using a database containing pirated material could weaken the Facebook and Instagram owner’s negotiations with regulators, according to the filing. “Media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, may undermine our negotiating position with regulators.”

The US author Ta-Nehisi Coates, the comedian Sarah Silverman and the other authors suing Meta for copyright infringement made the accusations in a filing made public on Wednesday, in a California federal court.

The authors sued Meta in 2023, arguing that the social media company misused their books to train Llama, the large language model that powers its chatbots.

The Library Genesis, or LibGen, dataset is a “shadow library” that originated in Russia and claims to contain millions of novels, nonfiction books and science magazine articles. Last year a New York federal court ordered LibGen’s anonymous operators to pay a group of publishers $30m (£24m) in damages for copyright infringement.

Use of copyrighted content in training AI models has become a legal battleground in the development of generative AI tools such as the ChatGPT chatbot, with creative professionals and publishers warning that using their work without permission is endangering their livelihoods and business models.

The filing cites a memo, referring to Mark Zuckerberg’s initials, noting that “after escalation to MZ”, Meta’s AI team “has been approved to use LibGen”.

Quoting internal communications, the filing also says Meta engineers discussed accessing and reviewing LibGen data but hesitated on starting that process because “torrenting”, a term for peer-to-peer sharing of files, from “a [Meta-owned] corporate laptop doesn’t feel right”.

A US district judge, Vince Chhabria, last year dismissed claims that text generated by Meta’s AI models infringed the authors’ copyrights and that Meta unlawfully stripped their books’ copyright management information (CMI), which refers to information about the work including the title, name of the author and copyright owner. However, the plaintiffs were given permission to amend their claims.

skip past newsletter promotion

The writers argued this week that the evidence bolstered their infringement claims and justified reviving their CMI case and adding a new computer fraud allegation.

Chhabria said during a hearing on Thursday that he would allow the writers to file an amended complaint but expressed scepticism about the merits of the fraud and CMI claims.

Meta has been contacted for comment.

Reuters contributed to this article


Leave a Reply

Your email address will not be published. Required fields are marked *