Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Jun 22, 2024 - businessinsider.com
OpenAI and Anthropic, two leading AI startups, are reportedly ignoring or bypassing the robots.txt web rule that prevents automated scraping of websites, according to TollBit, a startup that facilitates paid licensing deals between publishers and AI companies. Despite public statements from both companies claiming respect for the rule and blocks to their specific web crawlers, TollBit's findings suggest these blocks are not being respected. The companies are allegedly choosing to bypass the rule to scrape content from websites for free model training data.

The issue has arisen due to the increasing demand for high-quality data to build powerful AI models. OpenAI and Anthropic, backed by Microsoft and Amazon respectively, use large amounts of web-scraped text and data to power their chatbots, ChatGPT and Claude. Some tech companies have argued that web content should not be considered under copyright for AI training data. The US Copyright Office is expected to update its guidance on AI and copyright later this year.

Key takeaways:

  • AI startups OpenAI and Anthropic are reportedly ignoring or circumventing the web rule, robots.txt, which prevents automated scraping of websites for free model training data.
  • TollBit, a startup that aims to broker paid licensing deals between publishers and AI companies, found that several AI companies are acting in this way and informed certain large publishers.
  • Despite public statements of respecting robots.txt and blocks to their specific web crawlers, OpenAI and Anthropic are accused of not respecting such blocks and choosing to 'bypass' robots.txt to scrape content from websites.
  • OpenAI and Anthropic are behind popular chatbots ChatGPT and Claude respectively, which rely on massive amounts of written text and data scraped from the web, often under copyright or owned by creators.
View Full Article

Comments (0)

Be the first to comment!