創発的ミスアライメントは追従（おべっか）によって誘発され、アライメントゲーティングによって逆転され得る

要旨

先行研究では、狭いドメインにおける悪意のあるまたは誤った出力で大規模言語モデルをファインチューニングすると、広範なミスアライメントと有害な行動が誘発されることが示されています。この現象は創発的ミスアライメントとして知られています。しかし、そのようなミスアライメントを元に戻す効率的な方法は依然として限られています。本研究では、2つの貢献を行います。第一に、追従ファインチューニング（すなわち、ユーザーの誤った意見に受動的に同意するようにモデルを訓練すること）を、これまで十分に調査されていなかった創発的ミスアライメントの要因として特定し、それが広範で深刻なミスアライメント行動を誘発することを示します。第二に、創発的ミスアライメントを元に戻す効率的な手法であるAlignment Gatingを提案します。これは、ファインチューニング中に学習可能で制御可能なゲートをモデルに挿入するものです。ファインチューニングを通じて、これらのゲートは安全でない応答の原因となる内部表現を識別することを学習します。したがって、これらの表現を増幅または抑制することで、それぞれ創発的ミスアライメントを悪化または緩和します。さらに、アライメントゲーティングモジュールは強い汎化能力を示すことを発見しました。狭いドメインのファインチューニングで得られたゲーティング重みは、モデルの一般的な能力を維持しつつ、広いドメインのミスアライメント行動を大幅に抑制します。

English

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.