FuseChat：聊天模型的知识融合

FuseChat: Knowledge Fusion of Chat Models

February 25, 2024

作者: Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi

cs.AI

摘要

尽管从头开始训练大型语言模型（LLMs）确实可以导致具有独特能力和优势的模型，但这种方法会产生巨大成本，并可能导致能力的潜在冗余。另一种替代策略是将现有的LLMs组合成更强大的LLM，从而减少昂贵的预训练的必要性。然而，由于LLMs具有不同的架构，直接参数混合被证明是不可行的。最近，FuseLLM引入了知识融合的概念，通过轻量级持续训练将多个结构各异的LLMs的集体知识转移至目标LLM。在本报告中，我们将FuseLLM框架的可扩展性和灵活性扩展到实现对话LLMs的融合，从而产生FuseChat。FuseChat包括两个主要阶段。首先，我们对结构和规模各异的源LLMs进行知识融合，通过轻量级微调得出具有相同结构和大小的多个目标LLMs。然后，这些目标LLMs在参数空间内合并，我们提出了一种基于微调前后参数矩阵变化比率确定合并权重的新方法。我们使用三个具有不同架构和规模的知名对话LLMs，即NH2-Mixtral-8x7B、NH2-Solar-10.7B和OpenChat-3.5-7B来验证我们的方法。跨越各种对话领域的实验结果表明，在7B和34B规模上，\textsc{FuseChat-7B}在广泛的对话LLMs中表现优越，甚至超过了GPT-3.5（三月）并接近Mixtral-8x7B-Instruct。我们的代码、模型权重和数据可以在https://github.com/fanqiwan/FuseLLM 上公开访问。

English

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, this approach incurs substantial costs and may lead to potential redundancy in competencies. An alternative strategy is to combine existing LLMs into a more robust LLM, thereby diminishing the necessity for expensive pre-training. However, due to the diverse architectures of LLMs, direct parameter blending proves to be unfeasible. Recently, FuseLLM introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the FuseLLM framework to realize the fusion of chat LLMs, resulting in FuseChat. FuseChat comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely NH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B. Experimental results spanning various chat domains demonstrate the superiority of \textsc{FuseChat-7B} across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing GPT-3.5 (March) and approaching Mixtral-8x7B-Instruct. Our code, model weights, and data are openly accessible at https://github.com/fanqiwan/FuseLLM.