The dataset's release method is still under discussion, with Harvard seeking collaboration with Google for public distribution. This initiative is part of a broader trend of creating public-domain datasets to avoid copyright issues, with similar projects emerging globally. French AI startup Pleis has released its Common Corpus dataset, and AI startup Spawning has launched a public-domain image dataset. These efforts challenge the notion that copyrighted materials are necessary for building AI models, though concerns remain about whether these datasets will significantly alter current AI training practices.
Key takeaways:
- Harvard University is releasing a high-quality dataset of nearly one million public-domain books to support AI model training, funded by Microsoft and OpenAI.
- The dataset aims to provide equitable access to curated content for AI development, similar to how Linux serves as a foundational operating system.
- There is ongoing legal uncertainty regarding the use of copyrighted data for AI training, but public domain datasets like Harvard's are being developed to mitigate these issues.
- Other initiatives, such as the French Common Corpus and Spawning's Source.Plus, are also creating public-domain datasets to support ethical AI model training.