大規模言語モデルの多言語適応における二言語翻訳データの活用

要旨

本論文は、大規模多言語継続事前学習の実践における重要な設計決定、すなわち並列データの包含について調査する。具体的には、Llama3ファミリーモデルの500言語への大規模多言語適応における二言語翻訳データの影響を研究する。この目的のために、2,500以上の言語ペアからなるMaLA二言語翻訳コーパスを構築した。その後、Llama3ファミリーのベースモデルから継続的に事前学習を行い、671Bトークンに及ぶ多様なデータミックスを広範に使用して、4つの大規模多言語モデルからなるEMMA-500 Llama 3スイートを開発し、二言語翻訳データの有無による継続事前学習の効果を探る。7つのタスクと12のベンチマークにわたる包括的な評価により、二言語データは特に低リソース言語において言語転移とパフォーマンスを向上させる傾向があることが示された。MaLAコーパス、EMMA-500 Llama 3スイートの成果物、コード、およびモデル生成物をオープンソースとして公開する。

English

This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

大規模言語モデルの多言語適応における二言語翻訳データの活用

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

要旨

Support