モデル統合と安全性アラインメント：悪いモデルが全体を台無しにする

要旨

大規模言語モデル（LLM）のマージは、複数の専門家LLMを単一の汎用モデルに結合し、元のモデルの専門性を保持するためのコスト効率の高い技術です。しかし、現在のアプローチでは、マージ中の安全性アライメントの重要性を見落とすことが多く、結果として高度にミスアライメントされたモデルが生じる傾向があります。本研究は、モデルマージがアライメントに及ぼす影響を調査します。いくつかの人気のあるモデルマージ技術を評価し、既存の手法がドメイン専門知識を転送するだけでなく、ミスアライメントも伝播することを示します。この問題に対処するため、我々はシンプルな2段階アプローチを提案します：(i) 安全性とドメイン固有の合成データを生成し、(ii) これらの生成データを既存のデータ認識型モデルマージ技術の最適化プロセスに組み込みます。これにより、アライメントを結果として得られるマージ済みLLMで最大化可能なスキルとして扱うことができます。我々の実験は、マージ中にアライメント関連データを統合することの有効性を示し、ドメイン専門知識とアライメントの両方に優れたモデルを生み出すことを実証しています。

English

Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

モデル統合と安全性アラインメント：悪いモデルが全体を台無しにする

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

要旨

Support