The scramble for data comes as supplies of useful data were exhausted in 2021. OpenAI had previously trained its models on data from Github, chess databases, and Quizlet, while Google had asked its privacy team to tweak policy language to expand its use of consumer data. Meta, formerly Facebook, also reportedly discussed unpermitted use of copyrighted works. The Wall Street Journal suggests that companies may outpace new content by 2028, and that potential solutions could include training models on synthetic data or "curriculum learning". However, these approaches are unproven, and the alternative of using data without permission has led to multiple lawsuits.
Key takeaways:
- AI companies, including OpenAI and Google, are struggling to gather high-quality training data for their models, with OpenAI reportedly transcribing over a million hours of YouTube videos for its GPT-4 model.
- These practices fall into a gray area of AI copyright law, with companies arguing that it constitutes fair use, while others, including YouTube, argue it violates their terms of service.
- OpenAI and Google are also exploring other sources of data, including publicly available data, partnerships for non-public data, and generating their own synthetic data.
- The scarcity of training data is a growing problem in the AI industry, with potential solutions including training models on synthetic data or using 'curriculum learning' to make smarter connections between concepts using less information.