SambaLingo: 大規模言語モデルに新しい言語を教える

要旨

大規模言語モデル（LLM）が広く利用可能であるにもかかわらず、多様な言語におけるその能力と利用可能性には依然として大きな隔たりが存在します。これらの課題に対処するための一つのアプローチとして、既存の事前学習済みLLMを取得し、新しい言語で継続的に学習させる方法があります。先行研究では言語適応の実験が行われてきましたが、ベストプラクティスや方法論に関する多くの疑問が未解決のままです。本論文では、LLMの新たな言語への適応に関する包括的な調査を提示します。本研究では、語彙拡張、直接選好最適化、低リソース言語における人間の意図との整合性のためのデータ不足問題など、このプロセスの主要な構成要素を網羅しています。これらの実験を9言語と2つのパラメータ規模（7Bと70B）でスケールして実施しました。我々のモデルをLlama 2、Aya-101、XGLM、BLOOMおよび既存の言語専門家と比較し、これまでに公表されたすべてのベースラインを上回る結果を示しました。さらに、今後の研究を促進するため、すべての評価コードとチェックポイントを公開しています。

English

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

SambaLingo: 大規模言語モデルに新しい言語を教える

SambaLingo: Teaching Large Language Models New Languages

要旨

Summary

Support

Support