OpenAI's models 'memorized' copyrighted content, new study suggests | TechCrunch
Apr 04, 2025 - techcrunch.com
A new study suggests that OpenAI may have trained some of its AI models on copyrighted content without permission, supporting ongoing legal claims by authors and rights-holders. The study, co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford, introduces a method to identify training data "memorized" by models like GPT-4 and GPT-3.5. By removing "high-surprisal" words from texts and having the models guess them, the researchers found evidence that these models memorized parts of copyrighted fiction books and New York Times articles. This raises concerns about data transparency and the need for tools to audit AI models.
OpenAI has defended its practices by advocating for looser restrictions on using copyrighted data for model training, citing fair use. The company has some content licensing agreements and opt-out mechanisms for copyright owners but has also lobbied for clearer fair use rules in AI training. The study's findings highlight the need for greater transparency in AI model training data, as emphasized by co-author Abhilasha Ravichander, who stresses the importance of being able to scientifically probe and audit large language models.
Key takeaways:
A study suggests OpenAI trained some AI models on copyrighted content without permission, leading to legal challenges.
The study introduces a method to identify "memorized" training data in models by using "high-surprisal" words.
Tests showed GPT-4 memorized portions of copyrighted books and New York Times articles, raising concerns about data transparency.
OpenAI advocates for looser restrictions on using copyrighted data for AI training and has lobbied for "fair use" rules.