Despite the ability to block GPTBot, it does not guarantee that a site's data will not end up training future AI models. There are other large data sets of scraped websites, not affiliated with OpenAI, that are used to train open-source language models. Some sites have reacted to the news of being able to block their content from future GPT models with enthusiasm, but for large website operators, the choice to block language model crawlers isn't as straightforward. Blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.
Key takeaways:
- OpenAI has added details about its web crawler, GPTBot, to its online documentation site. This bot is used to retrieve webpages to train AI models like ChatGPT and GPT-4.
- OpenAI has implemented filters to prevent GPTBot from accessing sources behind paywalls, those collecting personally identifiable information, or any content violating OpenAI's policies.
- Website administrators can block GPTBot from crawling their websites using the industry-standard robots.txt file. OpenAI has provided instructions on how to do this.
- Despite the ability to block GPTBot, it does not guarantee that a site's data will not end up training future AI models due to other large data sets of scraped websites that are not affiliated with OpenAI.