Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

AI training data has a price tag that only Big Tech can afford | TechCrunch

Jun 01, 2024 - techcrunch.com
The article discusses the increasing importance and cost of data in training advanced AI systems, which is making it inaccessible for all but the wealthiest tech companies. It highlights that training data, rather than a model's design or architecture, is the key to sophisticated AI systems. However, the emphasis on large, high-quality training datasets is leading to a centralization of AI development among a few players with billion-dollar budgets. The article also points out the unethical practices of some companies in acquiring massive datasets, often through questionable means, and the exploitation of workers in third-world countries for data annotation.

The cost of data acquisition is growing, with AI training data market expected to grow from roughly $2.5 billion now to close to $30 billion within a decade. This is leading to a rush among data brokers and platforms to charge top dollar, often over the objections of their user bases. The article warns that this trend is harming the wider AI research community as smaller players won't be able to afford these data licenses. However, it also mentions a few independent, not-for-profit efforts to create massive datasets for public use, although they face challenges in keeping pace with Big Tech.

Key takeaways:

  • Training data is considered the key to increasingly sophisticated AI systems, with the more examples a model has to go on, the better the performance of models trained on those examples.
  • However, the growing emphasis on large, high-quality training datasets is centralizing AI development into the few players with billion-dollar budgets that can afford to acquire these sets.
  • There are concerns about unethical behavior in acquiring training data, including secretly aggregating copyrighted content and relying on low-paid workers in third-world countries to create annotations for training sets.
  • Independent, not-for-profit efforts are being made to create massive datasets anyone can use to train a generative AI model, but it's uncertain if these efforts can keep pace with Big Tech.
View Full Article

Comments (0)

Be the first to comment!