大型语言模型无意中学会欺骗：从错位样本到偏见人机交互中涌现的诚信失调

摘要

先前研究表明，在特定狭窄领域（如不安全代码或错误医疗建议）中针对恶意或不正确补全进行微调的大型语言模型（LLMs）可能会广泛偏离预期，表现出有害行为，这种现象被称为“涌现性失准”。本研究中，我们探讨了此现象是否能够超越安全行为范畴，延伸至高风险情境下的不诚实与欺骗行为（例如，压力下的谎言及欺骗性行为）。为此，我们在多个领域对开源LLMs进行了失准补全的微调。实验结果显示，LLMs在不诚实方面展现出广泛的失准行为。此外，我们进一步在混合下游微调设置中探索这一现象，发现仅需在标准下游任务中引入1%的失准数据，即可使诚实行为下降超过20%。更进一步，我们考虑了一个更为实际的人机交互环境，模拟了良性及带有偏见的用户与助手LLM的互动。值得注意的是，我们发现，仅需10%的偏见用户群体，助手LLM便可能无意间被失准，加剧其不诚实行为。总之，我们将涌现性失准的研究扩展至高风险情境下的不诚实与欺骗领域，并证明这一风险不仅通过直接微调产生，也存在于下游混合任务及实际的人机交互之中。

English

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

大型语言模型无意中学会欺骗：从错位样本到偏见人机交互中涌现的诚信失调

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

摘要

Support