The New York Times got its content removed from one of the biggest AI training datasets. Here's how it did it.

The New York Times (NYT) has discovered that its copyrighted content, including paywalled articles, was included in AI training datasets such as Common Crawl and WebText. The media company has asked Common Crawl to remove its content, arguing that AI models using this data can deliver answers directly, thereby bypassing the original source and potentially reducing the number of NYT readers and subscribers. Common Crawl, which has collected over 250 billion pages since 2007, agreed to comply with the request and not scrape any more NYT content in the future.

The NYT has also found its content in other AI training datasets and is taking action to stop this. Other content creators are also blocking Common Crawl's web scraping software, CCBot, including Amazon, Vimeo, and The New Yorker. It remains unclear whether the NYT has been successful in getting its content removed from other AI training datasets like WebText.

Key takeaways:

The New York Times discovered that its copyrighted content was being used in AI training datasets, such as Common Crawl and WebText, without permission.
The media company asked Common Crawl to remove its content, to which the foundation complied and agreed not to scrape any more NYT content in the future.
Common Crawl, one of the largest AI training datasets, was built by scraping most of the web and is used as training data for many large language models, including OpenAI's GPT-3.
Other content creators are also taking action against Common Crawl, with almost 14% of the 1,000 most popular websites blocking its crawling software, CCBot.

The New York Times got its content removed from one of the biggest AI training datasets. Here's how it did it.

Key takeaways:

Comments (0)

Newsletter