SambaLingo:教授大型语言模型新语言
SambaLingo: Teaching Large Language Models New Languages
April 8, 2024
作者: Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker
cs.AI
摘要
尽管大规模语言模型(LLMs)已经广泛可用,但它们在各种语言中的能力和可用性仍存在显著差距。解决这些问题的一种方法是采取现有的预训练LLM,并继续在新语言上进行训练。虽然先前的研究已经尝试过语言适应,但许多关于最佳实践和方法论的问题尚未涉及。在本文中,我们对LLMs适应新语言进行了全面调查。我们的研究涵盖了这一过程中的关键组成部分,包括词汇扩展、直接偏好优化以及在低资源语言中进行人类对齐时的数据稀缺问题。我们在9种语言和2种参数规模(7B和70B)上扩展了这些实验。我们将我们的模型与Llama 2、Aya-101、XGLM、BLOOM以及现有的语言专家进行了比较,优于所有先前发布的基线。此外,所有评估代码和检查点都已公开,以促进未来研究。
English
Despite the widespread availability of LLMs, there remains a substantial gap
in their capabilities and availability across diverse languages. One approach
to address these issues has been to take an existing pre-trained LLM and
continue to train it on new languages. While prior works have experimented with
language adaptation, many questions around best practices and methodology have
not been covered. In this paper, we present a comprehensive investigation into
the adaptation of LLMs to new languages. Our study covers the key components in
this process, including vocabulary extension, direct preference optimization
and the data scarcity problem for human alignment in low-resource languages. We
scale these experiments across 9 languages and 2 parameter scales (7B and 70B).
We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing
language experts, outperforming all prior published baselines. Additionally,
all evaluation code and checkpoints are made public to facilitate future
research.Summary
AI-Generated Summary