The two datasets, described as "internet-based books corpora," made up 16% of the training data for GPT-3 and contained 67 billion tokens of data, equivalent to about 50 billion words. OpenAI discontinued the use of these datasets for model training in late 2021 and deleted them in mid-2022 due to non-use. The two researchers who created the datasets are no longer with OpenAI, and the company is seeking to keep their identities and information about the datasets confidential. The Authors Guild opposes this, arguing for the public's right to know.
Key takeaways:
- OpenAI deleted two large datasets, "books1" and "books2," which were used to train its GPT-3 AI model, amid a class-action lawsuit by the Authors Guild alleging the use of copyrighted materials.
- The datasets, which may have contained more than 100,000 published books, were described as "internet-based books corpora" and made up 16% of the training data for GPT-3.
- The two researchers who created these datasets are no longer employed by OpenAI, and the company has resisted disclosing their identities publicly, leading to ongoing legal disputes.
- Despite the deletion of these datasets, OpenAI maintains that the models powering ChatGPT and their API today were not developed using these datasets and that all other data used to train GPT-3 remains intact.