EuroLLM:欧洲多语言语言模型
EuroLLM: Multilingual Language Models for Europe
September 24, 2024
作者: Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
cs.AI
摘要
开放权重的LLM的质量已经显著提高,但它们仍然主要集中在英语上。在本文中,我们介绍了EuroLLM项目,旨在开发一套能够理解和生成所有欧盟官方语言以及其他几种相关语言文本的开放权重多语言LLM。我们概述了迄今为止取得的进展,详细介绍了我们的数据收集和过滤过程,规模定律的发展,多语言分词器的创建,以及数据混合和建模配置。此外,我们发布了我们的初始模型:EuroLLM-1.7B和EuroLLM-1.7B-Instruct,并报告了它们在多语言通用基准和机器翻译上的表现。
English
The quality of open-weight LLMs has seen significant improvement, yet they
remain predominantly focused on English. In this paper, we introduce the
EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs
capable of understanding and generating text in all official European Union
languages, as well as several additional relevant languages. We outline the
progress made to date, detailing our data collection and filtering process, the
development of scaling laws, the creation of our multilingual tokenizer, and
the data mix and modeling configurations. Additionally, we release our initial
models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on
multilingual general benchmarks and machine translation.Summary
AI-Generated Summary