The article also explores the argument of "fair use", where tech companies claim that the AI models do not replicate the books but generate new works, and therefore do not harm the commercial market for the original works. However, the author counters this argument by stating that the use of copyrighted works without permission is a form of theft, and the exploitation of these works for profit is a disturbing trend.
Key takeaways:
- Large language models such as LLaMA and GPT-4 are being trained on copyrighted works from thousands of writers, including Stephen King, Zadie Smith, and Michael Pollan.
- These models are being developed in secret by companies like Meta and OpenAI, with little transparency about the extent of the texts used for training.
- A dataset called "Books3", containing upwards of 170,000 books, is being used to train these models. This dataset is alleged to contain pirated books, raising concerns about copyright violations.
- While some argue that using copyrighted material for AI training constitutes "fair use", the issue remains legally unsettled. This situation highlights a clash between the tech industry's open-source culture and the publishing world's need for more restrictive licenses to protect intellectual property.