LLaMA跨语言研究:关于语言能力转移的实证研究
LLaMA Beyond English: An Empirical Study on Language Capability Transfer
January 2, 2024
作者: Jun Zhao, Zhihao Zhang, Qi Zhang, Tao Gui, Xuanjing Huang
cs.AI
摘要
近年来,大型语言模型(LLMs)取得了显著进展,如ChatGPT所展示的,在各种复杂任务中表现出卓越的能力。然而,许多主流LLMs(例如LLaMA)是在以英语为主的语料库上进行预训练的,这限制了它们在其他非英语语言中的表现。本文关注如何有效地将语言生成和遵循指令的能力转移到非英语语言上。为了回答这个问题,我们基于LLaMA进行了一项持续超过1440个GPU小时的广泛实证调查。我们分析了诸如词汇扩展、进一步预训练和指令调整等关键因素对迁移的影响。为了准确评估模型的知识水平,我们采用了四个广泛使用的标准化测试基准:C-Eval、MMLU、AGI-Eval和GAOKAO-Bench。此外,我们进行了对模型响应质量的全面评估,考虑了准确性、流畅性、信息量、逻辑连贯性和无害性等方面,基于LLM-Eval,这是一个包含来自17个不同类别指令任务的基准。我们的评估结果表明,在知识对齐和响应质量方面,即使使用不到1%的预训练数据,也可以实现与最先进迁移模型相媲美的性能。此外,在十三种低资源语言的实验结果也呈现出类似的趋势。我们预计实验揭示的结论将有助于社区开发非英语LLMs。
English
In recent times, substantial advancements have been witnessed in large
language models (LLMs), exemplified by ChatGPT, showcasing remarkable
proficiency across a range of complex tasks. However, many mainstream LLMs
(e.g. LLaMA) are pretrained on English-dominant corpus, which limits their
performance in other non-English languages. In this paper, we focus on how to
effectively transfer the capabilities of language generation and following
instructions to a non-English language. To answer this question, we conduct an
extensive empirical investigation based on LLaMA, accumulating over 1440 GPU
hours. We analyze the impact of key factors such as vocabulary extension,
further pretraining, and instruction tuning on transfer. To accurately assess
the model's level of knowledge, we employ four widely used standardized testing
benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a
comprehensive evaluation of the model's response quality is conducted,
considering aspects such as accuracy, fluency, informativeness, logical
coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting
instruction tasks from 17 diverse categories. Our evaluation results
demonstrate that comparable performance to state-of-the-art transfer models can
be achieved with less than 1% of the pretraining data, both in terms of
knowledge alignment and response quality. Furthermore, the experimental
outcomes across the thirteen low-resource languages also exhibit similar
trends. We anticipate that the conclusions revealed by the experiments will aid
the community in developing non-English LLMs.Summary
AI-Generated Summary