模型合并与安全对齐：一个糟糕的模型会影响整体

摘要

将大型语言模型（LLMs）合并是一种经济高效的技术，可将多个专家LLMs合并为一个通用模型，保留原始模型的专业知识。然而，当前方法往往忽视了在合并过程中安全对齐的重要性，导致模型高度不对齐。本研究调查了模型合并对齐的影响。我们评估了几种流行的模型合并技术，表明现有方法不仅可以转移领域专业知识，还会传播不对齐。我们提出了一个简单的两步方法来解决这个问题：（一）生成合成的安全和领域特定数据，（二）将这些生成的数据纳入现有数据感知模型合并技术的优化过程中。这使我们能够将对齐视为一种可以在最终合并的LLMs中最大化的技能。我们的实验展示了在合并过程中整合与对齐相关的数据的有效性，从而产生在领域专业知识和对齐方面表现优异的模型。

English

Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

模型合并与安全对齐：一个糟糕的模型会影响整体

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

摘要

Support