대규모 다국어 적응을 위한 이중언어 번역 데이터를 활용한 대형 언어 모델의 적용

초록

본 논문은 대규모 다국어 지속 사전 학습(massively multilingual continual pre-training) 실무에서 중요한 설계 결정 사항인 병렬 데이터의 포함 여부를 조사한다. 구체적으로, 우리는 Llama3 모델 패밀리를 500개 언어에 대규모 다국어 적응(massively multilingual language adaptation)시키는 데 있어 이중어 번역 데이터의 영향을 연구한다. 이를 위해 2,500개 이상의 언어 쌍으로 구성된 MaLA 이중어 번역 코퍼스를 구축하였다. 이후, Llama 3 패밀리의 기본 모델을 다양한 데이터 믹스로 최대 671B 토큰까지 광범위하게 지속 사전 학습한 4개의 대규모 다국어 모델인 EMMA-500 Llama 3 제품군을 개발하고, 이중어 번역 데이터를 포함하거나 포함하지 않은 지속 사전 학습의 효과를 탐구하였다. 7개 작업과 12개 벤치마크에 걸친 포괄적인 평가 결과, 특히 저자원 언어(low-resource languages)의 경우 이중어 데이터가 언어 전이(language transfer)와 성능을 향상시키는 경향이 있음을 확인하였다. 우리는 MaLA 코퍼스, EMMA-500 Llama 3 제품군 아티팩트, 코드 및 모델 생성물을 오픈소스로 공개한다.

English

This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

대규모 다국어 적응을 위한 이중언어 번역 데이터를 활용한 대형 언어 모델의 적용

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

초록

Support