Data for A.I. Training Is Disappearing Fast, Study Shows

New research from the Data Provenance Initiative reveals a significant decrease in content available for collections used to build artificial intelligence (AI). The study found that many key web sources used for training AI models have limited the use of their data, leading to an "emerging crisis in consent". The researchers estimate that 5% of all data, and 25% of data from the highest-quality sources, have been restricted, primarily through the Robots Exclusion Protocol.

The study's lead author, Shayne Longpre, warns that this rapid decline in consent to use data across the web will have implications not just for AI companies, but also for researchers, academics, and noncommercial entities. The study also found that up to 45% of the data in one set, C4, had been restricted by websites’ terms of service.

Key takeaways

New research from the Data Provenance Initiative has found a significant decrease in content available for the collections used to build artificial intelligence.
Many of the most important web sources used for training A.I. models have restricted the use of their data, leading to an 'emerging crisis in consent.'
In three commonly used A.I. training data sets, 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted.
As much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service, indicating a rapid decline in consent to use data across the web.

Data for A.I. Training Is Disappearing Fast, Study Shows

Key takeaways

Discussion (0)