Dolma is designed to be transparent, with all sources and processes publicly documented. It is the largest open dataset to date, with 3 billion tokens, and is claimed to be the most straightforward in terms of use and permissions. It uses the "ImpACT license for medium-risk artifacts," requiring users to provide contact information and intended use cases, disclose any Dolma-derivative creations, distribute those derivatives under the same license, and agree not to apply Dolma to various prohibited areas. Access to Dolma is available via Hugging Face.
Key takeaways:
- The Allen Institute for AI (AI2) is introducing Dolma, a large, open text dataset for AI research. It is intended to be the basis for their planned open language model, OLMo.
- Dolma is designed to be transparent, with all its sources and processes publicly documented. It is the largest open dataset so far, with 3 billion tokens.
- Users of Dolma are required to provide contact information and intended use cases, disclose any Dolma-derivative creations, distribute those derivatives under the same license, and agree not to apply Dolma to various prohibited areas.
- AI2 has a removal request form for those who worry that their personal data may have made it into the database. Access to Dolma is available via Hugging Face.