EleutherAI releases massive AI training dataset of licensed and open domain text

EleutherAI has released The Common Pile v0.1, a large dataset of licensed and open-domain text for training AI models, developed in collaboration with AI startups and academic institutions. The dataset, which is 8 terabytes in size, was used to train two new AI models, Comma v0.1-1T and Comma v0.1-2T, which reportedly perform on par with models trained on unlicensed, copyrighted data. EleutherAI emphasizes that the dataset was curated with legal consultation and includes public domain sources, aiming to demonstrate that high-quality models can be built without relying on copyrighted material.

The release of The Common Pile v0.1 comes amid ongoing legal disputes involving AI companies like OpenAI over their use of copyrighted material for training datasets. EleutherAI argues that these lawsuits have reduced transparency in AI research, hindering the field's progress. The organization plans to release more open datasets in the future, addressing past criticisms related to The Pile, which included copyrighted content. EleutherAI's efforts highlight the potential for openly licensed data to support competitive AI model development.

Key takeaways

EleutherAI released The Common Pile v0.1, a large dataset for AI model training, created with licensed and open-domain text.
The dataset was used to train two AI models, Comma v0.1-1T and Comma v0.1-2T, which perform comparably to models trained on copyrighted data.
EleutherAI argues that copyright lawsuits have decreased transparency in AI research, impacting the field's understanding of model workings and flaws.
The Common Pile v0.1 is part of EleutherAI's effort to release open datasets more frequently, correcting past practices involving copyrighted material.

EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch

Key takeaways

Discussion (0)