Website owners who do not want their content used for AI training can ask Google and OpenAI's bots to skip their site. This only applies to future scraping and does not affect data already collected or data posted elsewhere. It also does not prevent other companies from scraping for their own LLMs. To block scraping, website owners need to edit or create a "robots.txt" file on their site, which provides instructions for bots and web crawlers. The article provides detailed instructions on how to do this.
Key takeaways:
- OpenAI and Google have released guidelines for website owners who do not want their site content used to train the companies' large language models (LLMs).
- Website owners can ask the bots deployed by Google and OpenAI to skip over their site by editing or creating a "robots.txt" file on their website.
- However, this only applies to future scraping and does not affect data already collected or data scraped by other companies.
- While there is no technical requirement for a bot to obey these requests, many crawling services respect the request, and it is one step website owners can take if they are uncomfortable with their content being used in AI training sets.