LLaMAX：通过增强翻译能力拓展LLM的语言视野，超越100种语言

摘要

大型语言模型（LLMs）展示了在高资源语言任务中出色的翻译能力，然而它们在低资源语言中的表现受到预训练期间多语言数据不足的影响。为了解决这一问题，我们投入了35,000个A100-SXM4-80GB GPU小时进行LLaMA系列模型的广泛多语言持续预训练，实现了对100多种语言的翻译支持。通过对训练策略（如词汇扩展和数据增强）的全面分析，我们开发了LLaMAX。值得注意的是，LLaMAX在不牺牲泛化能力的情况下，相较于现有开源LLMs，实现了显著更高的翻译性能（高出10个spBLEU点以上），并在Flores-101基准测试中与专门的翻译模型（M2M-100-12B）表现相当。广泛的实验表明，LLaMAX可以作为一个强大的多语言基础模型。代码\url{https://github.com/CONE-MT/LLaMAX/.}和模型\url{https://huggingface.co/LLaMAX/.}已公开提供。

English

Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\url{https://github.com/CONE-MT/LLaMAX/.} and models~\url{https://huggingface.co/LLaMAX/.} are publicly available.

LLaMAX：通过增强翻译能力拓展LLM的语言视野，超越100种语言

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

摘要

Support