No Robots(.txt): How to Ask ChatGPT and Google Bard to Not Use Your Website for Training

OpenAI and Google have provided instructions for website owners who wish to prevent the companies from using their site content to train large language models (LLMs). The companies have long supported web scraping for research, journalism, and archiving purposes, and believe it is still legal when collecting training data for generative AI. However, they acknowledge that norms are developing around what types of scraping and uses of scraped data are acceptable. To respect these norms, they have created a tool for website operators to signal their preference to web crawlers.

Website owners who do not want their content used for AI training can ask Google and OpenAI's bots to skip their site. This only applies to future scraping and does not affect data already collected or data posted elsewhere. It also does not prevent other companies from scraping for their own LLMs. To block scraping, website owners need to edit or create a "robots.txt" file on their site, which provides instructions for bots and web crawlers. The article provides detailed instructions on how to do this.

Key takeaways:

OpenAI and Google have released guidelines for website owners who do not want their site content used to train the companies' large language models (LLMs).
Website owners can ask the bots deployed by Google and OpenAI to skip over their site by editing or creating a "robots.txt" file on their website.
However, this only applies to future scraping and does not affect data already collected or data scraped by other companies.
While there is no technical requirement for a bot to obey these requests, many crawling services respect the request, and it is one step website owners can take if they are uncomfortable with their content being used in AI training sets.

No Robots(.txt): How to Ask ChatGPT and Google Bard to Not Use Your Website for Training

Key takeaways:

Comments (0)

Newsletter