Revealed: The Authors Whose Pirated Books Are Powering Generative AI

The article discusses the use of copyrighted books to train large language models by tech companies such as Meta and OpenAI. The author reveals that a dataset called "Books3", containing over 170,000 books, is being used to train these models. The dataset includes works from authors like Stephen King, Zadie Smith, and Michael Pollan, among others. The author argues that this practice is a violation of copyright laws and equates it to piracy.

The article also explores the argument of "fair use", where tech companies claim that the AI models do not replicate the books but generate new works, and therefore do not harm the commercial market for the original works. However, the author counters this argument by stating that the use of copyrighted works without permission is a form of theft, and the exploitation of these works for profit is a disturbing trend.

Key takeaways:

Large language models such as LLaMA and GPT-4 are being trained on copyrighted works from thousands of writers, including Stephen King, Zadie Smith, and Michael Pollan.
These models are being developed in secret by companies like Meta and OpenAI, with little transparency about the extent of the texts used for training.
A dataset called "Books3", containing upwards of 170,000 books, is being used to train these models. This dataset is alleged to contain pirated books, raising concerns about copyright violations.
While some argue that using copyrighted material for AI training constitutes "fair use", the issue remains legally unsettled. This situation highlights a clash between the tech industry's open-source culture and the publishing world's need for more restrictive licenses to protect intellectual property.

Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Key takeaways:

Comments (0)

Newsletter