FuseChat:聊天模型的知識融合
FuseChat: Knowledge Fusion of Chat Models
February 25, 2024
作者: Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi
cs.AI
摘要
儘管從頭開始訓練大型語言模型(LLMs)確實可以產生具有獨特能力和優勢的模型,但這種方法會產生高昂的成本,並可能導致能力上的潛在冗餘。另一種替代策略是將現有的LLMs結合成一個更強大的LLM,從而減少昂貴的預訓練的必要性。然而,由於LLMs具有不同的架構,直接參數混合被證明是不可行的。最近,FuseLLM引入了知識融合的概念,通過輕量級持續訓練將多個結構差異的LLMs的集體知識轉移至目標LLM。在本報告中,我們擴展了FuseLLM框架的可擴展性和靈活性,以實現聊天LLMs的融合,形成FuseChat。FuseChat包括兩個主要階段。首先,我們對結構和規模不同的源LLMs進行知識融合,通過輕量級微調來獲得具有相同結構和大小的多個目標LLMs。然後,這些目標LLMs在參數空間內合併,我們提出了一種基於微調前後參數矩陣變化比率來確定合併權重的新方法。我們使用三個具有不同架構和規模的知名聊天LLMs,即NH2-Mixtral-8x7B、NH2-Solar-10.7B和OpenChat-3.5-7B,來驗證我們的方法。跨越各種聊天領域的實驗結果表明,在7B和34B規模上,FuseChat-7B在各種聊天LLMs中表現卓越,甚至超越了GPT-3.5(三月)並接近Mixtral-8x7B-Instruct。我們的代碼、模型權重和數據可在https://github.com/fanqiwan/FuseLLM 公開訪問。
English
While training large language models (LLMs) from scratch can indeed lead to
models with distinct capabilities and strengths, this approach incurs
substantial costs and may lead to potential redundancy in competencies. An
alternative strategy is to combine existing LLMs into a more robust LLM,
thereby diminishing the necessity for expensive pre-training. However, due to
the diverse architectures of LLMs, direct parameter blending proves to be
unfeasible. Recently, FuseLLM introduced the concept of knowledge
fusion to transfer the collective knowledge of multiple structurally varied
LLMs into a target LLM through lightweight continual training. In this report,
we extend the scalability and flexibility of the FuseLLM framework to
realize the fusion of chat LLMs, resulting in FuseChat.
FuseChat comprises two main stages. Firstly, we undertake knowledge
fusion for structurally and scale-varied source LLMs to derive multiple target
LLMs of identical structure and size via lightweight fine-tuning. Then, these
target LLMs are merged within the parameter space, wherein we propose a novel
method for determining the merging weights based on the variation ratio of
parameter matrices before and after fine-tuning. We validate our approach using
three prominent chat LLMs with diverse architectures and scales, namely
NH2-Mixtral-8x7B, NH2-Solar-10.7B, and
OpenChat-3.5-7B. Experimental results spanning various chat domains
demonstrate the superiority of \textsc{FuseChat-7B} across a broad
spectrum of chat LLMs at 7B and 34B scales, even surpassing GPT-3.5
(March) and approaching Mixtral-8x7B-Instruct. Our code, model
weights, and data are openly accessible at
https://github.com/fanqiwan/FuseLLM.