AI2 drops biggest open dataset yet for training language models

The Allen Institute for AI (AI2) is working to reverse the trend of closely guarded AI training data with Dolma, a large, free-to-use text dataset. Dolma, short for "Data to feed OLMo's Appetite," is intended to be the foundation for AI2's planned open language model, OLMo. AI2 argues that just as the model is free to use and modify by the AI research community, the dataset used to create it should also be freely available.

Dolma is designed to be transparent, with all sources and processes publicly documented. It is the largest open dataset to date, with 3 billion tokens, and is claimed to be the most straightforward in terms of use and permissions. It uses the "ImpACT license for medium-risk artifacts," requiring users to provide contact information and intended use cases, disclose any Dolma-derivative creations, distribute those derivatives under the same license, and agree not to apply Dolma to various prohibited areas. Access to Dolma is available via Hugging Face.

Key takeaways

The Allen Institute for AI (AI2) is introducing Dolma, a large, open text dataset for AI research. It is intended to be the basis for their planned open language model, OLMo.
Dolma is designed to be transparent, with all its sources and processes publicly documented. It is the largest open dataset so far, with 3 billion tokens.
Users of Dolma are required to provide contact information and intended use cases, disclose any Dolma-derivative creations, distribute those derivatives under the same license, and agree not to apply Dolma to various prohibited areas.
AI2 has a removal request form for those who worry that their personal data may have made it into the database. Access to Dolma is available via Hugging Face.

AI2 drops biggest open dataset yet for training language models | TechCrunch

Key takeaways

Discussion (0)