Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

AI's Desperate Hunger For News Training Data Has Publishers Fighting Back. Here’s How.

Jun 04, 2024 - news.bensbites.com
The article discusses the escalating battle between AI companies and news organizations over the scraping of news content for AI training data. Companies like Google, Meta, Anthropic, and OpenAI are developing advanced AI models, but their tactics, including mass scraping of news articles, have led to backlash from the journalism world. News organizations such as Graham Media Group, The New York Times, The Guardian, Hearst, and Hubbard Broadcasting have blocked AI chatbots from scraping their sites, citing concerns over threats to their business models and the integrity of journalism.

In response to AI scraping, newsrooms are updating their terms of service to ban AI scraping, blocking AI data scraping bots, licensing their content to AI companies for training data, and creating their own Language Models (LLMs). Some organizations have also filed lawsuits against OpenAI and Google for illegally harvesting data. As more publishers put up barriers to web scraping, AI companies are exploring alternatives like synthetic data. The article suggests that collaboration, through proactive licensing of content as training data, could be a sustainable way for AI and journalism to coexist.

Key takeaways:

  • News organizations are fighting back against AI companies scraping their content without permission, with some calling it the “largest theft in the United States.”
  • Newsrooms are taking steps to protect their content, including updating terms of service to ban AI scraping, blocking AI data scraping bots, licensing training content to AI companies, and creating their own LLMs.
  • Some news organizations have filed lawsuits against AI companies like OpenAI and Google, accusing them of illegally harvesting “massive amounts of personal data” to train their AI chatbots.
  • As more publishers put up barriers to web scraping, AI companies are exploring alternative paths, such as using synthetic data or collaborating with news organizations to license access to their content as training data.
View Full Article

Comments (0)

Be the first to comment!