모델 병합과 안전성 정렬: 하나의 나쁜 모델이 전체를 망친다

초록

대규모 언어 모델(LLM) 병합은 여러 전문가 LLM을 단일 다목적 모델로 결합하여 원본 모델의 전문성을 유지하는 비용 효율적인 기술입니다. 그러나 현재의 접근 방식들은 병합 과정에서 안전 정렬의 중요성을 종종 간과하여 심각하게 정렬되지 않은 모델을 초래합니다. 본 연구는 모델 병합이 정렬에 미치는 영향을 조사합니다. 우리는 여러 인기 있는 모델 병합 기법을 평가하며, 기존 방법들이 도메인 전문성을 전달할 뿐만 아니라 정렬 오류도 전파한다는 것을 입증합니다. 이 문제를 해결하기 위해 우리는 간단한 두 단계 접근 방식을 제안합니다: (i) 합성 안전 및 도메인 특화 데이터를 생성하고, (ii) 이러한 생성된 데이터를 기존의 데이터 인식 모델 병합 기법의 최적화 과정에 통합하는 것입니다. 이를 통해 정렬을 결과적으로 병합된 LLM에서 극대화할 수 있는 기술로 취급할 수 있습니다. 우리의 실험은 병합 과정에서 정렬 관련 데이터를 통합하는 것이 도메인 전문성과 정렬 모두에서 우수한 모델을 만드는 데 효과적임을 보여줍니다.

English

Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

모델 병합과 안전성 정렬: 하나의 나쁜 모델이 전체를 망친다

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

초록

Support