Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Zadie Smith, Stephen King and Rachel Cusk’s pirated works used to train AI

Sep 11, 2023 - theguardian.com
The Atlantic has reported that thousands of authors' pirated works, including those of Zadie Smith, Stephen King, Rachel Cusk, and Elena Ferrante, have been used to train AI tools. Over 170,000 titles were fed into models run by companies such as Meta and Bloomberg, using the dataset “Books3”. This dataset was used to train AI models like Meta’s LLaMA, Bloomberg’s BloombergGPT, and EleutherAI’s GPT-J. The titles in Books3 are roughly one-third fiction and two-thirds nonfiction, with the majority published within the last two decades.

This revelation follows a lawsuit filed last month by three writers alleging that their copyrighted works were used as part of training Meta’s LLaMA. OpenAI, the creator of AI chatbot ChatGPT, has also been accused of training its model on copyrighted works. The independent AI developer who created Books3, Shawn Presser, expressed sympathy for authors' concerns but defended the creation of the database for the development of generative AI tools. While Meta declined to comment, a Bloomberg spokesperson confirmed their use of Books3 but stated they will not use it for future versions of BloombergGPT.

Key takeaways:

  • Thousands of authors, including Zadie Smith and Stephen King, have had their pirated works used to train AI tools, with over 170,000 titles used by companies such as Meta and Bloomberg.
  • The dataset, known as Books3, was used to train AI models like Meta's LLaMA and Bloomberg's BloombergGPT. It contains a mix of fiction and non-fiction, with the majority of books published in the last two decades.
  • OpenAI, the company behind AI chatbot ChatGPT, has also been accused of training its model on copyrighted works. A lawsuit alleges that the company's training data comes from "shadow libraries" that offer pirated books.
  • Despite the controversy, the creator of Books3, Shawn Presser, defends the dataset, arguing it allows anyone to develop generative AI tools and prevents large companies from monopolizing the technology.
View Full Article

Comments (0)

Be the first to comment!