Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Nov 28, 2024 - arxiv.org
The authors present two multilingual Language Learning Models (LLMs) designed to support all 24 official languages of the European Union, addressing the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. These models were trained on a dataset comprising around 60% non-English data and utilized a custom multilingual tokenizer. The development principles of the models, including data composition, tokenizer optimization, and training methodologies, are detailed in the article.

The models demonstrated competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA. The authors believe that these models embrace Europe's linguistic diversity and can be a significant step towards more inclusive language learning models.

Key takeaways:

  • The authors have developed two multilingual Language Models (LLMs) that support all 24 official languages of the European Union, addressing the limitations of existing LLMs that predominantly focus on English or a few high-resource languages.
  • The models were trained on a dataset comprising around 60% non-English data and utilized a custom multilingual tokenizer.
  • The development principles of the models include data composition, tokenizer optimization, and training methodologies.
  • The models have shown competitive performance across multilingual benchmarks, as demonstrated by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.
View Full Article

Comments (0)

Be the first to comment!