Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Twice as many companies block OpenAI's GPTbot, other AI web crawlers

Sep 28, 2023 - businessinsider.com
Hundreds of major companies and websites are now blocking web crawlers from AI models like ChatGPT and Common Crawl. Over the course of three weeks, the number of top sites blocking GPTbot has jumped to more than 250, including Amazon, Tumblr, Pinterest, Vimeo, GrubHub, and several news outlets. Unique and accurate information, mainly scraped from the web, is vital to the performance of AI models, but the practice has led to several lawsuits and potential new government regulations.

Common Crawl's web crawler, CCBot, is also being blocked by many companies. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot. Those blocking CCBot include Amazon, Vimeo, Masterclass, The New York Times, and The New Yorker. Many of those blocking CCBot also block GPTBot. Despite this, tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.

Key takeaways:

  • Hundreds of major companies and websites are blocking web crawlers from ChatGPT and Common Crawl, which are major sources of AI training data.
  • Over three weeks, the number of top sites blocking GPTbot has jumped to more than 250, including Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and all titles published by Hearst and those by Conde Nast.
  • Unique and accurate information is vital to the performance of generative AI models like OpenAI's GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright.
  • As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, a web crawler used by Common Crawl. Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic.
View Full Article

Comments (0)

Be the first to comment!