Common Crawl's web crawler, CCBot, is also being blocked by many companies. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot. Those blocking CCBot include Amazon, Vimeo, Masterclass, The New York Times, and The New Yorker. Many of those blocking CCBot also block GPTBot. Despite this, tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.
Key takeaways:
- Hundreds of major companies and websites are blocking web crawlers from ChatGPT and Common Crawl, which are major sources of AI training data.
- Over three weeks, the number of top sites blocking GPTbot has jumped to more than 250, including Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and all titles published by Hearst and those by Conde Nast.
- Unique and accurate information is vital to the performance of generative AI models like OpenAI's GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright.
- As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, a web crawler used by Common Crawl. Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic.