1
Feature Story
OpenAI's models 'memorized' copyrighted content, new study suggests | TechCrunch
Apr 04, 2025 · techcrunch.com
OpenAI has defended its practices by advocating for looser restrictions on using copyrighted data for model training, citing fair use. The company has some content licensing agreements and opt-out mechanisms for copyright owners but has also lobbied for clearer fair use rules in AI training. The study's findings highlight the need for greater transparency in AI model training data, as emphasized by co-author Abhilasha Ravichander, who stresses the importance of being able to scientifically probe and audit large language models.
Key takeaways
- A study suggests OpenAI trained some AI models on copyrighted content without permission, leading to legal challenges.
- The study introduces a method to identify "memorized" training data in models by using "high-surprisal" words.
- Tests showed GPT-4 memorized portions of copyrighted books and New York Times articles, raising concerns about data transparency.
- OpenAI advocates for looser restrictions on using copyrighted data for AI training and has lobbied for "fair use" rules.