How Tech Giants Cut Corners to Harvest Data for A.I.

OpenAI, Google, and Meta have reportedly ignored corporate policies and considered bending copyright laws in their quest for data to train their artificial intelligence systems. OpenAI, for instance, developed a tool called Whisper to transcribe YouTube videos for conversational text, despite discussions about potential violations of YouTube's rules. The team, including OpenAI's president, Greg Brockman, transcribed over a million hours of YouTube videos, which were then used to train the GPT-4 system, a powerful AI model.

Meanwhile, Meta, owner of Facebook and Instagram, considered purchasing the publishing house Simon & Schuster to obtain long works and discussed gathering copyrighted data from across the internet, even if it risked lawsuits. The company's managers, lawyers, and engineers argued that negotiating licenses with publishers, artists, musicians, and the news industry would be too time-consuming. The actions of these tech companies highlight the desperate race for digital data to advance AI technology.

Key takeaways

OpenAI, Google, and Meta have reportedly ignored corporate policies and discussed skirting copyright law in their quest for online information to train their artificial intelligence systems.
OpenAI developed a tool called Whisper to transcribe YouTube videos for conversational text to train their AI, potentially against YouTube's rules.
OpenAI's team transcribed over a million hours of YouTube videos, which were then used to train a system called GPT-4, one of the world's most powerful AI models.
At Meta, managers, lawyers, and engineers discussed buying Simon & Schuster to procure long works and considered gathering copyrighted data from across the internet, even at the risk of facing lawsuits.

How Tech Giants Cut Corners to Harvest Data for A.I.

Key takeaways

Discussion (0)