Sites scramble to block ChatGPT web crawler after instructions emerge

OpenAI has recently added details about its web crawler, GPTBot, to its online documentation. The GPTBot is used to retrieve webpages to train AI models like ChatGPT and GPT-4. The company has implemented filters to ensure that content behind paywalls, those collecting personal information, or violating OpenAI's policies will not be accessed by GPTBot. However, the new instructions may not prevent web-browsing versions of ChatGPT from accessing current websites. OpenAI also provides instructions on how to block GPTBot from crawling websites using the industry-standard robots.txt file.

Despite the ability to block GPTBot, it does not guarantee that a site's data will not end up training future AI models. There are other large data sets of scraped websites, not affiliated with OpenAI, that are used to train open-source language models. Some sites have reacted to the news of being able to block their content from future GPT models with enthusiasm, but for large website operators, the choice to block language model crawlers isn't as straightforward. Blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.

Key takeaways

OpenAI has added details about its web crawler, GPTBot, to its online documentation site. This bot is used to retrieve webpages to train AI models like ChatGPT and GPT-4.
OpenAI has implemented filters to prevent GPTBot from accessing sources behind paywalls, those collecting personally identifiable information, or any content violating OpenAI's policies.
Website administrators can block GPTBot from crawling their websites using the industry-standard robots.txt file. OpenAI has provided instructions on how to do this.
Despite the ability to block GPTBot, it does not guarantee that a site's data will not end up training future AI models due to other large data sets of scraped websites that are not affiliated with OpenAI.

Sites scramble to block ChatGPT web crawler after instructions emerge

Key takeaways

Discussion (0)