The issue has arisen due to the increasing demand for high-quality data to build powerful AI models. OpenAI and Anthropic, backed by Microsoft and Amazon respectively, use large amounts of web-scraped text and data to power their chatbots, ChatGPT and Claude. Some tech companies have argued that web content should not be considered under copyright for AI training data. The US Copyright Office is expected to update its guidance on AI and copyright later this year.
Key takeaways:
- AI startups OpenAI and Anthropic are reportedly ignoring or circumventing the web rule, robots.txt, which prevents automated scraping of websites for free model training data.
- TollBit, a startup that aims to broker paid licensing deals between publishers and AI companies, found that several AI companies are acting in this way and informed certain large publishers.
- Despite public statements of respecting robots.txt and blocks to their specific web crawlers, OpenAI and Anthropic are accused of not respecting such blocks and choosing to 'bypass' robots.txt to scrape content from websites.
- OpenAI and Anthropic are behind popular chatbots ChatGPT and Claude respectively, which rely on massive amounts of written text and data scraped from the web, often under copyright or owned by creators.