Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

The authors present two multilingual Language Learning Models (LLMs) designed to support all 24 official languages of the European Union, addressing the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. These models were trained on a dataset comprising around 60% non-English data and utilized a custom multilingual tokenizer. The development principles of the models, including data composition, tokenizer optimization, and training methodologies, are detailed in the article.

The models demonstrated competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA. The authors believe that these models embrace Europe's linguistic diversity and can be a significant step towards more inclusive language learning models.

Key takeaways:

The authors have developed two multilingual Language Models (LLMs) that support all 24 official languages of the European Union, addressing the limitations of existing LLMs that predominantly focus on English or a few high-resource languages.
The models were trained on a dataset comprising around 60% non-English data and utilized a custom multilingual tokenizer.
The development principles of the models include data composition, tokenizer optimization, and training methodologies.
The models have shown competitive performance across multilingual benchmarks, as demonstrated by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Key takeaways:

Comments (0)

Newsletter