Harvard and Google to release 1 million public-domain books as AI training dataset

Harvard University plans to release a dataset of approximately 1 million public-domain books, featuring works from authors like Dickens, Dante, and Shakespeare. These books, no longer under copyright protection, are part of Google's book-scanning project, Google Books. The release aims to make this extensive dataset accessible to a wide range of users, including research labs and AI startups, to aid in training large language models. Although the release date and method are not yet specified, Google will assist in distributing the dataset.

This initiative is part of Harvard's Institutional Data Initiative (IDI), which was formally launched with financial support from Microsoft and OpenAI. The IDI aims to provide a "trusted conduit for legal data for AI," with the dataset designed to democratize access to valuable resources for AI development. Greg Leppert, the IDI’s executive director, emphasizes that the project seeks to "level the playing field" by making the dataset available to entities beyond deep-pocketed tech firms.

Key takeaways

Harvard University plans to release a dataset of around 1 million public-domain books, including works by authors like Dickens, Dante, and Shakespeare.
The dataset is derived from Google Books and will involve Google in its release, although the exact release details are not yet clear.
The Institutional Data Initiative (IDI) was formally launched with financial backing from Microsoft and OpenAI, aiming to provide legal data for AI.
The dataset is intended to "level the playing field" by making it accessible to various entities, including research labs and AI startups, for training large language models.

Harvard and Google to release 1 million public-domain books as AI training dataset | TechCrunch

Key takeaways

Discussion (0)