FuseChat:聊天模型的知识融合
FuseChat: Knowledge Fusion of Chat Models
August 15, 2024
作者: Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, Xiaojun Quan
cs.AI
摘要
尽管从头开始训练大型语言模型(LLMs)确实可以导致具有独特能力和优势的模型,但会产生巨大成本,并可能导致能力的冗余。知识融合旨在通过轻量级持续训练,将具有不同架构和能力的现有LLMs整合成一个更强大的LLM,从而减少昂贵的LLM开发需求。在这项工作中,我们提出了一个新的框架,用于通过两个主要阶段融合聊天LLMs的知识,最终形成FuseChat。首先,我们对具有不同结构和规模的源聊天LLMs进行成对知识融合,通过轻量级微调创建具有相同结构和大小的多个目标LLMs。在此过程中,引入了基于统计的标记对齐方法作为融合具有不同结构的LLMs的基石。其次,我们在参数空间内合并这些目标LLMs,提出了一种基于微调前后参数更新量大小确定合并系数的新方法。我们使用包括OpenChat-3.5-7B、Starling-LM-7B-alpha、NH2-SOLAR-10.7B、InternLM2-Chat-20B、Mixtral-8x7B-Instruct和Qwen-1.5-Chat-72B在内的六个知名聊天LLMs,实施并验证了FuseChat。在两个指令遵循基准测试AlpacaEval 2.0和MT-Bench上的实验结果表明,FuseChat-7B优于各种规模的基线模型。我们的模型甚至可以与更大的Mixtral-8x7B-Instruct相媲美,并在MT-Bench上接近GPT-3.5-Turbo-1106。我们的代码、模型权重和数据可在https://github.com/fanqiwan/FuseAI 上公开获取。
English
While training large language models (LLMs) from scratch can indeed lead to
models with distinct capabilities and strengths, it incurs substantial costs
and may lead to redundancy in competencies. Knowledge fusion aims to integrate
existing LLMs of diverse architectures and capabilities into a more potent LLM
through lightweight continual training, thereby reducing the need for costly
LLM development. In this work, we propose a new framework for the knowledge
fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we
conduct pairwise knowledge fusion on source chat LLMs of varying structures and
scales to create multiple target LLMs with identical structure and size via
lightweight fine-tuning. During this process, a statistics-based token
alignment approach is introduced as the cornerstone for fusing LLMs with
different structures. Secondly, we merge these target LLMs within the parameter
space, where we propose a novel method for determining the merging coefficients
based on the magnitude of parameter updates before and after fine-tuning. We
implement and validate FuseChat using six prominent chat LLMs with diverse
architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha,
NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and
Qwen-1.5-Chat-72B. Experimental results on two instruction-following
benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of
FuseChat-7B over baselines of various sizes. Our model is even comparable to
the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench.
Our code, model weights, and data are public at
https://github.com/fanqiwan/FuseAI.Summary
AI-Generated Summary