大语言模型中的零-shot 跨语言转移的层交换
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
October 2, 2024
作者: Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu
cs.AI
摘要
模型合并,如模型混合,是将具有相同架构的不同模型结合在一起而无需进一步训练的做法。在这项工作中,我们提出了一种模型合并方法,解决了在非英语语言中为目标任务微调大型语言模型(LLMs)的困难,其中任务特定数据通常不可用。我们专注于数学推理,在没有语言内数学数据的情况下,通过组合语言和数学能力促进跨语言转移。从相同的预训练模型开始,我们在英语数学指导数据和目标语言通用指导数据上分别对“专家”进行微调。然后,我们直接用语言专家的层替换数学专家的顶部和底部Transformer层,从而增强目标语言中的数学性能。合并后的模型在数学基准测试MGSM中表现优于单个专家和其他合并方法,跨四种主要语言,其中数学指导数据稀缺,性能提高了10%。此外,这种层交换简单、廉价且直观,因为它基于对每个专家微调过程中最重要参数变化的解释性分析。成功以这种方式重新组合LLMs以实现跨语言转移的能力,为将来结合模型专业知识、创建模块化解决方案以及跨语言传递推理能力打开了未来可能性。
English
Model merging, such as model souping, is the practice of combining different
models with the same architecture together without further training. In this
work, we present a model merging methodology that addresses the difficulty of
fine-tuning Large Language Models (LLMs) for target tasks in non-English
languages, where task-specific data is often unavailable. We focus on
mathematical reasoning and without in-language math data, facilitate
cross-lingual transfer by composing language and math capabilities. Starting
from the same pretrained model, we fine-tune separate "experts" on math
instruction data in English and on generic instruction data in the target
language. We then replace the top and bottom transformer layers of the math
expert directly with layers from the language expert, which consequently
enhances math performance in the target language. The resulting merged models
outperform the individual experts and other merging methods on the math
benchmark, MGSM, by 10% across four major languages where math instruction data
is scarce. In addition, this layer swapping is simple, inexpensive, and
intuitive, as it is based on an interpretative analysis of the most important
parameter changes during the fine-tuning of each expert. The ability to
successfully re-compose LLMs for cross-lingual transfer in this manner opens up
future possibilities to combine model expertise, create modular solutions, and
transfer reasoning capabilities across languages all post hoc.Summary
AI-Generated Summary