This initiative is part of Harvard's Institutional Data Initiative (IDI), which was formally launched with financial support from Microsoft and OpenAI. The IDI aims to provide a "trusted conduit for legal data for AI," with the dataset designed to democratize access to valuable resources for AI development. Greg Leppert, the IDI’s executive director, emphasizes that the project seeks to "level the playing field" by making the dataset available to entities beyond deep-pocketed tech firms.
Key takeaways:
- Harvard University plans to release a dataset of around 1 million public-domain books, including works by authors like Dickens, Dante, and Shakespeare.
- The dataset is derived from Google Books and will involve Google in its release, although the exact release details are not yet clear.
- The Institutional Data Initiative (IDI) was formally launched with financial backing from Microsoft and OpenAI, aiming to provide legal data for AI.
- The dataset is intended to "level the playing field" by making it accessible to various entities, including research labs and AI startups, for training large language models.