The study's lead author, Shayne Longpre, warns that this rapid decline in consent to use data across the web will have implications not just for AI companies, but also for researchers, academics, and noncommercial entities. The study also found that up to 45% of the data in one set, C4, had been restricted by websites’ terms of service.
Key takeaways:
- New research from the Data Provenance Initiative has found a significant decrease in content available for the collections used to build artificial intelligence.
- Many of the most important web sources used for training A.I. models have restricted the use of their data, leading to an 'emerging crisis in consent.'
- In three commonly used A.I. training data sets, 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted.
- As much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service, indicating a rapid decline in consent to use data across the web.