The cost of data acquisition is growing, with AI training data market expected to grow from roughly $2.5 billion now to close to $30 billion within a decade. This is leading to a rush among data brokers and platforms to charge top dollar, often over the objections of their user bases. The article warns that this trend is harming the wider AI research community as smaller players won't be able to afford these data licenses. However, it also mentions a few independent, not-for-profit efforts to create massive datasets for public use, although they face challenges in keeping pace with Big Tech.
Key takeaways:
- Training data is considered the key to increasingly sophisticated AI systems, with the more examples a model has to go on, the better the performance of models trained on those examples.
- However, the growing emphasis on large, high-quality training datasets is centralizing AI development into the few players with billion-dollar budgets that can afford to acquire these sets.
- There are concerns about unethical behavior in acquiring training data, including secretly aggregating copyrighted content and relying on low-paid workers in third-world countries to create annotations for training sets.
- Independent, not-for-profit efforts are being made to create massive datasets anyone can use to train a generative AI model, but it's uncertain if these efforts can keep pace with Big Tech.