LLaMAX: 100개 이상의 언어를 넘어 번역 능력 강화로 LLM의 언어적 지평 확장

초록

대규모 언어 모델(LLMs)은 고자원 언어 작업에서 뛰어난 번역 능력을 보여주지만, 저자원 언어에서는 사전 학습 중 다국어 데이터 부족으로 인해 성능이 제한됩니다. 이를 해결하기 위해, 우리는 LLaMA 시리즈 모델에 대해 35,000 A100-SXM4-80GB GPU 시간을 투자하여 광범위한 다국어 지속 사전 학습을 수행하고, 100개 이상의 언어에 대한 번역 지원을 가능하게 했습니다. 어휘 확장 및 데이터 증강과 같은 학습 전략에 대한 포괄적인 분석을 통해 LLaMAX를 개발했습니다. 주목할 만한 점은, 일반화 능력을 희생하지 않으면서도 LLaMAX는 기존 오픈소스 LLMs보다 훨씬 높은 번역 성능(10 spBLEU 포인트 이상)을 달성했으며, Flores-101 벤치마크에서 전문 번역 모델(M2M-100-12B)과 동등한 성능을 보였습니다. 광범위한 실험 결과, LLaMAX는 강력한 다국어 기반 모델로 사용될 수 있음이 입증되었습니다. 코드(\url{https://github.com/CONE-MT/LLaMAX/.})와 모델(\url{https://huggingface.co/LLaMAX/.})은 공개되어 있습니다.

English

Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\url{https://github.com/CONE-MT/LLaMAX/.} and models~\url{https://huggingface.co/LLaMAX/.} are publicly available.

LLaMAX: 100개 이상의 언어를 넘어 번역 능력 강화로 LLM의 언어적 지평 확장

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

초록

Support