FuseChat:聊天模型的知識融合
FuseChat: Knowledge Fusion of Chat Models
August 15, 2024
作者: Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, Xiaojun Quan
cs.AI
摘要
儘管從頭開始訓練大型語言模型(LLMs)確實可以產生具有獨特能力和優勢的模型,但這將帶來可觀的成本,並可能導致能力上的冗餘。知識融合旨在將具有不同架構和能力的現有LLMs整合成一個更強大的LLM,透過輕量級的持續訓練,從而減少昂貴的LLM開發需求。在這項工作中,我們提出了一個新的框架,用於通過兩個主要階段對聊天LLMs進行知識融合,最終形成FuseChat。首先,我們對具有不同結構和規模的源聊天LLMs進行成對知識融合,通過輕量級微調創建多個具有相同結構和大小的目標LLMs。在此過程中,引入了一種基於統計的標記對齊方法,作為融合具有不同結構的LLMs的基石。其次,我們在參數空間內合併這些目標LLMs,提出了一種基於微調前後參數更新量的合併係數確定新方法。我們使用六個具有不同架構和規模的知名聊天LLMs(包括OpenChat-3.5-7B、Starling-LM-7B-alpha、NH2-SOLAR-10.7B、InternLM2-Chat-20B、Mixtral-8x7B-Instruct和Qwen-1.5-Chat-72B)實施並驗證了FuseChat。在兩個指令遵循基準測試AlpacaEval 2.0和MT-Bench上的實驗結果顯示,FuseChat-7B優於各種大小的基準。我們的模型甚至與更大的Mixtral-8x7B-Instruct相媲美,並接近GPT-3.5-Turbo-1106在MT-Bench上的表現。我們的代碼、模型權重和數據可在https://github.com/fanqiwan/FuseAI 公開獲取。
English
While training large language models (LLMs) from scratch can indeed lead to
models with distinct capabilities and strengths, it incurs substantial costs
and may lead to redundancy in competencies. Knowledge fusion aims to integrate
existing LLMs of diverse architectures and capabilities into a more potent LLM
through lightweight continual training, thereby reducing the need for costly
LLM development. In this work, we propose a new framework for the knowledge
fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we
conduct pairwise knowledge fusion on source chat LLMs of varying structures and
scales to create multiple target LLMs with identical structure and size via
lightweight fine-tuning. During this process, a statistics-based token
alignment approach is introduced as the cornerstone for fusing LLMs with
different structures. Secondly, we merge these target LLMs within the parameter
space, where we propose a novel method for determining the merging coefficients
based on the magnitude of parameter updates before and after fine-tuning. We
implement and validate FuseChat using six prominent chat LLMs with diverse
architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha,
NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and
Qwen-1.5-Chat-72B. Experimental results on two instruction-following
benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of
FuseChat-7B over baselines of various sizes. Our model is even comparable to
the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench.
Our code, model weights, and data are public at
https://github.com/fanqiwan/FuseAI.Summary
AI-Generated Summary