ChatPaper.aiChatPaper

通过上下文学习产生的突发性错位:狭窄的上下文示例可能导致大语言模型广泛错配

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

October 13, 2025
作者: Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov
cs.AI

摘要

近期研究表明,窄域微调可能导致大语言模型(LLMs)出现广泛的错位现象,这一现象被称为“涌现性错位”(Emergent Misalignment, EM)。尽管这一发现令人担忧,但此前的研究仅限于微调和激活导向,未涵盖上下文学习(In-Context Learning, ICL)。因此,我们提出疑问:ICL中是否也会出现EM?我们的研究发现确实如此:在三个数据集上,三种前沿模型在给定64个窄域上下文示例时,产生广泛错位响应的比例介于2%至17%之间,而在256个示例时,这一比例可高达58%。我们还通过引导逐步推理(同时保持上下文示例不变)来探究EM的机制。对由此产生的思维链进行人工分析发现,67.5%的错位轨迹通过采用鲁莽或危险的“角色”,明确地为有害输出提供合理化解释,这与先前关于微调引发EM的研究结果相呼应。
English
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.
PDF432October 20, 2025