출현적 정렬 불일치는 아첨에 의해 유도될 수 있으며, 정렬 게이팅을 통해 역전될 수 있다.

초록

이전 연구들은 좁은 도메인에서 대규모 언어 모델을 악의적이거나 부정확한 출력에 대해 미세 조정하면 광범위한 정렬 불일치와 유해한 행동, 즉 발현적 정렬 불일치(emergent misalignment)로 알려진 현상을 유도할 수 있음을 보여주었다. 그러나 이러한 정렬 불일치를 되돌리기 위한 효율적인 방법은 여전히 제한적이다. 본 연구에서는 두 가지 기여를 한다. 첫째, 우리는 아첨 미세 조정(sycophancy fine-tuning), 즉 사용자의 부정확한 의견에 수동적으로 동의하도록 모델을 훈련시키는 것이 이전에 충분히 탐구되지 않은 발현적 정렬 불일치의 동인임을 식별하고, 이것이 광범위하고 심각한 정렬 불일치 행동을 유도함을 보여준다. 둘째, 우리는 정렬 게이팅(Alignment Gating)을 제안한다. 이는 미세 조정 중에 학습 가능하고 제어 가능한 게이트를 모델에 삽입하여 발현적 정렬 불일치를 되돌리는 효율적인 방법이다. 미세 조정을 통해 이 게이트는 안전하지 않은 응답을 담당하는 내부 표현을 식별하는 방법을 학습한다. 따라서 이러한 표현을 증폭하거나 억제하면 각각 발현적 정렬 불일치가 악화되거나 완화된다. 나아가 정렬 게이팅 모듈이 강력한 일반화를 나타냄을 발견했다: 좁은 도메인 미세 조정에서 얻은 게이트 가중치가 모델의 일반 능력을 유지하면서 넓은 도메인의 정렬 불일치 행동을 상당히 억제한다.

English

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.