EuroLLM：欧洲多语言语言模型

摘要

开放权重的LLM的质量已经显著提高，但它们仍然主要集中在英语上。在本文中，我们介绍了EuroLLM项目，旨在开发一套能够理解和生成所有欧盟官方语言以及其他几种相关语言文本的开放权重多语言LLM。我们概述了迄今为止取得的进展，详细介绍了我们的数据收集和过滤过程，规模定律的发展，多语言分词器的创建，以及数据混合和建模配置。此外，我们发布了我们的初始模型：EuroLLM-1.7B和EuroLLM-1.7B-Instruct，并报告了它们在多语言通用基准和机器翻译上的表现。

English

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.