大规模语言模型的双语翻译数据多语言适应

摘要

本文探讨了在大规模多语言持续预训练实践中的一个关键设计决策——并行数据的引入。具体而言，我们研究了双语翻译数据对Llama3系列模型适应500种语言的大规模多语言调整的影响。为此，我们构建了MaLA双语翻译语料库，包含超过2,500种语言对的数据。随后，我们开发了EMMA-500 Llama 3套件，包含四个大规模多语言模型——这些模型从Llama 3系列基础模型出发，通过多达671B个token的多样化数据混合进行持续预训练——并探索了在有无双语翻译数据的情况下进行持续预训练的效果。在7项任务和12个基准上的全面评估表明，双语数据往往能增强语言迁移和性能，尤其是对于低资源语言。我们开源了MaLA语料库、EMMA-500 Llama 3套件相关资源、代码及模型生成结果。

English

This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

大规模语言模型的双语翻译数据多语言适应

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

摘要

Support