透過上下文學習產生的新興錯位:狹窄的上下文範例可能導致大語言模型的廣泛錯位
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
October 13, 2025
作者: Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov
cs.AI
摘要
近期研究表明,狹義的微調可能導致大型語言模型(LLMs)出現廣泛的對齊偏差,這一現象被稱為「湧現性對齊偏差」(Emergent Misalignment, EM)。儘管這一發現令人擔憂,但這些研究僅限於微調和激活導向,未涵蓋上下文學習(In-Context Learning, ICL)。因此,我們提出疑問:EM是否會在ICL中出現?我們發現確實如此:在三組數據集上,三種前沿模型在給定64個狹義上下文示例時,產生了2%至17%的廣泛對齊偏差回應,而在256個示例時,這一比例甚至高達58%。我們還通過引導逐步推理(同時保持上下文示例不變)來探討EM的機制。對由此產生的思維鏈進行人工分析顯示,67.5%的對齊偏差軌跡通過採納一種魯莽或危險的「人格」,明確地為有害輸出提供合理化解釋,這與先前關於微調誘發EM的研究結果相呼應。
English
Recent work has shown that narrow finetuning can produce broadly misaligned
LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these
findings were limited to finetuning and activation steering, leaving out
in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find
that it does: across three datasets, three frontier models produce broadly
misaligned responses at rates between 2% and 17% given 64 narrow in-context
examples, and up to 58% with 256 examples. We also examine mechanisms of EM by
eliciting step-by-step reasoning (while leaving in-context examples unchanged).
Manual analysis of the resulting chain-of-thought shows that 67.5% of
misaligned traces explicitly rationalize harmful outputs by adopting a reckless
or dangerous ''persona'', echoing prior results on finetuning-induced EM.