SambaLingo: 대규모 언어 모델에 새로운 언어를 가르치기

초록

LLM(대형 언어 모델)이 널리 보급되고 있음에도 불구하고, 다양한 언어에 대한 이들의 능력과 접근성 사이에는 상당한 격차가 존재합니다. 이러한 문제를 해결하기 위한 한 가지 접근 방식은 기존에 사전 학습된 LLM을 가져와 새로운 언어에 대해 추가 학습을 진행하는 것입니다. 이전 연구들에서 언어 적응에 대한 실험을 진행했지만, 최적의 방법론과 관련된 많은 질문들이 아직 다루어지지 않았습니다. 본 논문에서는 새로운 언어에 대한 LLM의 적응에 대해 포괄적인 연구를 제시합니다. 우리의 연구는 이 과정의 주요 구성 요소들, 즉 어휘 확장, 직접 선호 최적화, 그리고 저자원 언어에서 인간 정렬을 위한 데이터 부족 문제 등을 다룹니다. 우리는 이러한 실험을 9개 언어와 2가지 파라미터 규모(7B와 70B)에 걸쳐 확장하여 진행했습니다. 우리는 Llama 2, Aya-101, XGLM, BLOOM 및 기존 언어 전문가 모델들과 비교 평가를 수행했으며, 모든 기존 공개된 기준선을 능가하는 성능을 보였습니다. 또한, 향후 연구를 촉진하기 위해 모든 평가 코드와 체크포인트를 공개했습니다.

English

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

SambaLingo: 대규모 언어 모델에 새로운 언어를 가르치기

SambaLingo: Teaching Large Language Models New Languages

초록

Summary

Support

Support