LLaMAX：通過增強超過100種語言的翻譯能力，擴展LLM的語言範圍

摘要

大型語言模型（LLMs）展示了在高資源語言任務中卓越的翻譯能力，然而在低資源語言中，由於預訓練期間多語言數據不足，它們的表現受到了阻礙。為了解決這個問題，我們投入了35,000個A100-SXM4-80GB GPU小時，對LLaMA系列模型進行了廣泛的多語言持續預訓練，實現了在100多種語言之間的翻譯支持。通過對訓練策略的全面分析，如詞彙擴展和數據增強，我們開發了LLaMAX。值得注意的是，LLaMAX在不犧牲泛化能力的情況下，相比現有的開源LLMs表現出顯著更高的翻譯性能（超過10個spBLEU點），並在Flores-101基準測試上與專用翻譯模型（M2M-100-12B）表現相當。廣泛的實驗表明，LLaMAX可以作為一個強大的多語言基礎模型。代碼\url{https://github.com/CONE-MT/LLaMAX/.}和模型\url{https://huggingface.co/LLaMAX/.}已公開提供。

English

Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\url{https://github.com/CONE-MT/LLaMAX/.} and models~\url{https://huggingface.co/LLaMAX/.} are publicly available.

LLaMAX：通過增強超過100種語言的翻譯能力，擴展LLM的語言範圍

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

摘要

Support