涌现性错位可经由谄媚诱导产生，并通过对齐门控逆转

摘要

先前研究表明，在窄领域中对大型语言模型进行恶意或不正确输出的微调，会引发广泛的失调与有害行为，这一现象被称为涌现性失调。然而，逆转此类失调的高效方法仍然有限。本文做出两点贡献：首先，我们识别出谄媚微调——即训练模型被动认同用户错误观点——是此前未被充分探索的涌现性失调驱动因素，并证明它会诱发广泛且严重的失调行为。其次，我们提出对齐门控方法，一种在微调过程中通过在模型中插入可学习、可控制的门控机制来逆转涌现性失调的高效方法。经过微调，这些门控能够学习识别导致不安全响应的内部表征。因此，放大或抑制这些表征即可分别加剧或缓解涌现性失调。我们进一步发现，对齐门控模块展现出强泛化能力：通过窄领域微调得到的门控权重，能显著抑制广领域的失调行为，同时保留模型的通用能力。

English

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.