Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

OpenAI transcribed over a million hours of YouTube videos to train GPT-4

Apr 09, 2024 - theverge.com
The New York Times reports that AI companies, including OpenAI and Google, are resorting to questionable methods to gather high-quality training data for their models. OpenAI reportedly developed its Whisper audio transcription model by transcribing over a million hours of YouTube videos, a move that was legally dubious but considered fair use. Google also allegedly gathered transcripts from YouTube, but in accordance with agreements with YouTube creators. Both companies are said to be exploring the creation of their own synthetic data.

The scramble for data comes as supplies of useful data were exhausted in 2021. OpenAI had previously trained its models on data from Github, chess databases, and Quizlet, while Google had asked its privacy team to tweak policy language to expand its use of consumer data. Meta, formerly Facebook, also reportedly discussed unpermitted use of copyrighted works. The Wall Street Journal suggests that companies may outpace new content by 2028, and that potential solutions could include training models on synthetic data or "curriculum learning". However, these approaches are unproven, and the alternative of using data without permission has led to multiple lawsuits.

Key takeaways:

  • AI companies, including OpenAI and Google, are struggling to gather high-quality training data for their models, with OpenAI reportedly transcribing over a million hours of YouTube videos for its GPT-4 model.
  • These practices fall into a gray area of AI copyright law, with companies arguing that it constitutes fair use, while others, including YouTube, argue it violates their terms of service.
  • OpenAI and Google are also exploring other sources of data, including publicly available data, partnerships for non-public data, and generating their own synthetic data.
  • The scarcity of training data is a growing problem in the AI industry, with potential solutions including training models on synthetic data or using 'curriculum learning' to make smarter connections between concepts using less information.
View Full Article

Comments (0)

Be the first to comment!