The NYT has also found its content in other AI training datasets and is taking action to stop this. Other content creators are also blocking Common Crawl's web scraping software, CCBot, including Amazon, Vimeo, and The New Yorker. It remains unclear whether the NYT has been successful in getting its content removed from other AI training datasets like WebText.
Key takeaways:
- The New York Times discovered that its copyrighted content was being used in AI training datasets, such as Common Crawl and WebText, without permission.
- The media company asked Common Crawl to remove its content, to which the foundation complied and agreed not to scrape any more NYT content in the future.
- Common Crawl, one of the largest AI training datasets, was built by scraping most of the web and is used as training data for many large language models, including OpenAI's GPT-3.
- Other content creators are also taking action against Common Crawl, with almost 14% of the 1,000 most popular websites blocking its crawling software, CCBot.