Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

This is where the data to build AI comes from

Dec 19, 2024 - technologyreview.com
The article highlights the challenges and concerns surrounding the data practices in AI development. It points out that AI models rely heavily on vast amounts of data, but the sources and quality of this data are often unclear. The Data Provenance Initiative, comprising over 50 researchers, audited nearly 4,000 public data sets and found that AI data practices risk concentrating power in the hands of a few dominant tech companies. Since 2018, the web has become the primary source for AI data sets, leading to a gap between scraped and curated data. This trend raises concerns about the concentration of power, particularly with platforms like YouTube, which provide a significant portion of data for multimodal models. The article also discusses the hidden restrictions on data usage and the exclusive data-sharing deals that benefit large AI companies, creating asymmetric access to data.

Additionally, the article addresses the Western bias in AI training data, with over 90% of data sets coming from Europe and North America. This bias is partly due to the dominance of the English language on the internet and the convenience of using Western data. The lack of representation from other cultures and languages can lead to AI models that reinforce a US-centric worldview, potentially erasing other cultures and languages. The article emphasizes the need for more diverse and intentional data collection practices to ensure AI models accurately reflect the global human experience.

Key takeaways:

  • AI data collection practices are immature compared to AI model development, with a lack of transparency about data sources and potential concentration of power in a few tech companies.
  • The dominance of web-scraped data since 2018 has led to a gap between curated and indiscriminately collected data, with YouTube becoming a major source for video data.
  • Exclusive data-sharing deals by major AI companies create asymmetric access, benefiting large corporations while disadvantaging smaller entities and researchers.
  • AI training data is heavily skewed towards Western cultures, potentially reinforcing biases and erasing other languages and cultures from AI models.
View Full Article

Comments (0)

Be the first to comment!