SambaLingo：教導大型語言模型新語言

摘要

儘管大規模語言模型(LLMs)已經廣泛可用，但在不同語言之間仍存在實質性的能力和可用性差距。解決這些問題的一種方法是將現有的預訓練LLM繼續在新語言上進行訓練。雖然先前的研究已經嘗試過語言適應，但許多關於最佳實踐和方法論的問題尚未涵蓋。在本文中，我們對LLMs適應新語言進行了全面調查。我們的研究涵蓋了這個過程中的關鍵組成部分，包括詞彙擴展、直接偏好優化以及在資源稀缺語言中進行人類對齊的數據稀缺問題。我們將這些實驗擴展到9種語言和2個參數規模(7B和70B)。我們將我們的模型與Llama 2、Aya-101、XGLM、BLOOM和現有的語言專家進行比較，並優於所有先前發表的基準線。此外，所有評估代碼和檢查點都已公開，以促進未來研究。

English

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

SambaLingo：教導大型語言模型新語言

SambaLingo: Teaching Large Language Models New Languages

摘要

Support