These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

The article discusses a data set of over 191,000 books, known as "Books3", used without permission to train generative-AI systems by companies like Meta and Bloomberg. The author reveals that these books, mostly pirated ebooks published in the last 20 years, are now at the center of several lawsuits filed by authors like Sarah Silverman, Michael Chabon, and Paul Tremblay, who allege copyright infringement. The author also notes that the AI-training practices are secretive and nonconsensual, with very few people understanding how these programs are developed.

The author has created a searchable database to help authors find out if their works are included in the Books3 data set. However, the author warns of several caveats, such as multiple appearances of the same book due to different editions or translations, inconsistencies in the spelling of author names, and the possibility of false positives due to errors in the book-identification process.

Key takeaways

A data set of more than 191,000 books, known as "Books3," was used without permission to train generative-AI systems by Meta, Bloomberg, and others. This has led to several lawsuits against Meta by authors claiming copyright infringement.
Most authors were unaware that their works were being used in this way, and the people building and training these AI systems stand to profit significantly.
Meta's spokesperson did not directly address the use of pirated books in training their generative-AI product, LLaMA, instead referring to a court filing arguing that the case should be dismissed as the AI model and its outputs are not "substantially similar" to the authors' books.
The author of the article has created a searchable database for authors to check if their works are included in the Books3 data set. However, due to inconsistencies in author name spelling and possible errors in book identification, there may be some inaccuracies.

These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

Key takeaways

Discussion (0)