FuseChat: チャットモデルの知識融合

要旨

大規模言語モデル（LLM）をゼロからトレーニングすることは、確かに独自の能力と強みを持つモデルを生み出す可能性がありますが、多大なコストがかかり、能力の重複を招く可能性があります。知識融合は、多様なアーキテクチャと能力を持つ既存のLLMを、軽量な継続的トレーニングを通じて統合し、より強力なLLMを構築することを目指しており、これにより、高コストなLLM開発の必要性を削減します。本研究では、チャットLLMの知識融合のための新しいフレームワークを提案し、FuseChatを実現します。まず、異なる構造と規模を持つソースチャットLLMに対してペアワイズ知識融合を行い、軽量なファインチューニングを通じて同一の構造とサイズを持つ複数のターゲットLLMを作成します。このプロセスでは、異なる構造を持つLLMを融合するための基盤として、統計ベースのトークンアライメント手法を導入します。次に、これらのターゲットLLMをパラメータ空間内で統合し、ファインチューニング前後のパラメータ更新の大きさに基づいて統合係数を決定する新しい方法を提案します。FuseChatは、OpenChat-3.5-7B、Starling-LM-7B-alpha、NH2-SOLAR-10.7B、InternLM2-Chat-20B、Mixtral-8x7B-Instruct、Qwen-1.5-Chat-72Bなど、多様なアーキテクチャと規模を持つ6つの主要なチャットLLMを使用して実装および検証しました。AlpacaEval 2.0とMT-Benchという2つの指示追従ベンチマークでの実験結果は、FuseChat-7Bがさまざまなサイズのベースラインを上回る優位性を示しています。我々のモデルは、より大規模なMixtral-8x7B-Instructに匹敵し、MT-BenchではGPT-3.5-Turbo-1106に近い性能を発揮します。コード、モデル重み、データはhttps://github.com/fanqiwan/FuseAIで公開しています。

English

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at https://github.com/fanqiwan/FuseAI.

FuseChat: チャットモデルの知識融合

FuseChat: Knowledge Fusion of Chat Models

要旨

Support