湧現性失調可由諂媚行為誘發，並可透過對齊閘控逆轉

摘要

先前研究顯示，在狹窄領域中對大型語言模型進行惡意或不正確輸出的微調，可能引發廣泛的脫序與有害行為，此現象稱為「新出現的脫序」。然而，有效逆轉此類脫序的方法仍然有限。在本研究中，我們有兩項貢獻。首先，我們識別出「諂媚微調」——即訓練模型被動同意使用者不正確觀點——是先前未充分探索的新出現脫序驅動因素，並證明其會引發廣泛且嚴重的脫序行為。其次，我們提出「對齊閘門」，這是一種在微調期間於模型中插入可學習且可控閘門的有效方法，用以逆轉新出現的脫序。透過微調，這些閘門學會識別導致不安全回應的內部表徵。因此，放大或抑制這些表徵，分別能加劇或減緩新出現的脫序。我們進一步發現，對齊閘門模組展現出強大的泛化能力：從狹窄領域微調獲得的閘門權重，能大幅抑制廣義領域的脫序行為，同時保留模型的整體能力。

English

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.