A poster’s guide to who’s selling your data to train AI

The article discusses how internet user data is being scraped and used to train AI systems like ChatGPT, Midjourney, and Sora, often without user consent. This has led to lawsuits, such as The New York Times suing OpenAI for allegedly using its archives without permission, and Getty Images suing Stable Diffusion for copyright infringement. However, some companies have chosen to license their archives to AI companies, such as the Associated Press and Shutterstock with OpenAI. The article warns that anyone who posts content online could have their content sold by the hosting platforms to AI companies.

The article also highlights specific instances of data selling. Automattic, the parent company for Tumblr and WordPress, is reportedly preparing to announce deals selling user data to OpenAI and Midjourney. Reddit has already sold access to its posts to Google in a $60 million deal. The article concludes by stating that large AI models are likely being trained on posts across the internet, with public posts from platforms like Facebook and Instagram being used to train AI models.

Key takeaways

Internet users' data is often scraped and used to train AI systems, sometimes without the permission of the content creators.
Companies like Tumblr, WordPress, and Reddit have reportedly been selling user data to AI companies like OpenAI and Midjourney.
Automattic, the parent company of Tumblr and WordPress, has announced a way for users to opt out of sharing their public content with third parties.
Reddit has sold access to its posts to Google for $60 million to train its generative AI models.

A poster’s guide to who’s selling your data to train AI

Key takeaways

Discussion (0)