Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Harvard University has announced the release of a high-quality dataset of nearly one million public-domain books, created by its Institutional Data Initiative with funding from Microsoft and OpenAI. This dataset, significantly larger than the Books3 dataset used for training AI models like Meta’s Llama, includes a diverse range of genres, languages, and time periods, featuring works from authors like Shakespeare and Dickens. The initiative aims to provide access to curated content repositories for the general public and smaller AI industry players, similar to those typically available only to large tech companies. Microsoft supports the project as part of its commitment to creating accessible data pools for AI startups, though it does not plan to replace its existing AI training data with public domain alternatives.

The dataset's release method is still under discussion, with Harvard seeking collaboration with Google for public distribution. This initiative is part of a broader trend of creating public-domain datasets to avoid copyright issues, with similar projects emerging globally. French AI startup Pleis has released its Common Corpus dataset, and AI startup Spawning has launched a public-domain image dataset. These efforts challenge the notion that copyrighted materials are necessary for building AI models, though concerns remain about whether these datasets will significantly alter current AI training practices.

Key takeaways

Harvard University is releasing a high-quality dataset of nearly one million public-domain books to support AI model training, funded by Microsoft and OpenAI.
The dataset aims to provide equitable access to curated content for AI development, similar to how Linux serves as a foundational operating system.
There is ongoing legal uncertainty regarding the use of copyrighted data for AI training, but public domain datasets like Harvard's are being developed to mitigate these issues.
Other initiatives, such as the French Common Corpus and Spawning's Source.Plus, are also creating public-domain datasets to support ethical AI model training.

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Key takeaways

Discussion (0)