Despite sites using robots.txt to specify which data can be accessed by web crawlers, Anthropic has been accused of ignoring these and taking data regardless. This is despite the company being founded by former OpenAI researchers with the aim of developing "responsible" AI systems. The issue of overly aggressive web crawlers is reportedly common across the AI industry, leading to calls for AI companies to be more respectful in their data gathering practices.
Key takeaways:
- Anthropic has been aggressively scraping data from websites to train its Claude LLM, regardless of permission.
- Anthropic's ClaudeBot has been reported to hit sites millions of times in a short period, causing significant strain on resources.
- Despite the use of robots.txt by websites to indicate what data can be accessed, Anthropic ignores it and takes the data anyway.
- Anthropic was founded by former OpenAI researchers with the promise to develop 'responsible' AI systems, but their current practices are being questioned.