Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

AI crawlers need to be more respectful

Jul 26, 2024 - news.bensbites.com
The article discusses the issue of abusive site crawling by AI products and services, which has been causing problems for Read the Docs, a platform that hosts documentation for many projects. The AI crawlers are aggressively pulling content without checks against abuse, downloading large files hundreds of times daily from various IP addresses without rate or bandwidth limiting. This behavior is costing the platform a significant amount of money in bandwidth charges and time spent dealing with the abuse. Two examples of abuse include one crawler downloading 73 TB of zipped HTML files in May 2024, costing over $5,000 in bandwidth charges, and another using Facebook's content downloader to download 10 TB of data in June 2024.

In response to this, Read the Docs has taken actions such as temporarily blocking all traffic from bots identified as AI Crawlers, monitoring bandwidth usage more closely, and working on more aggressive rate limiting rules. However, the additional bandwidth costs caused by AI crawlers are likely to result in the platform running out of AWS credits early. The platform is asking all AI companies to be more respectful when crawling sites and to implement basic checks in their crawlers. They are open to working with these companies to create a deal that allows respectful site crawling.

Key takeaways:

  • AI crawlers are causing significant problems for Read the Docs, a community-supported site that hosts documentation for many projects, by aggressively pulling content and causing high bandwidth charges.
  • Examples of abuse include one crawler downloading 73 TB of zipped HTML files in May 2024, costing over $5,000 in bandwidth charges, and another using Facebook's content downloader to download 10 TB of data in June 2024.
  • Read the Docs has taken actions to mitigate this abuse, including temporarily blocking all traffic from bots identified as AI Crawlers, monitoring bandwidth usage more closely, and working on more aggressive rate limiting rules.
  • Read the Docs is calling on all AI companies to be more respectful when crawling sites, suggesting the possibility of building an integration that would alert them to content changes and download only the files that have changed.
View Full Article

Comments (0)

Be the first to comment!