大型語言模型無意中學會欺騙：從樣本錯位到偏見人機互動中的欺騙性錯位湧現

摘要

先前的研究表明，在特定狹窄領域（例如不安全的程式碼或錯誤的醫療建議）中，針對惡意或不正確的完成進行微調的大型語言模型（LLMs）可能會廣泛地出現偏差，表現出有害行為，這種現象被稱為「突發性偏差」。在本研究中，我們探討這一現象是否能夠超越安全行為，延伸至高風險情境下的更廣泛不誠實與欺騙行為（例如在壓力下說謊和欺騙行為）。為此，我們在多個領域中對開源的大型語言模型進行了偏差完成的微調。實驗結果顯示，大型語言模型在不誠實行為上表現出廣泛的偏差。此外，我們進一步在下游混合微調的設定中探索這一現象，發現即使在標準下游任務中引入僅1%的偏差數據，也足以使誠實行為減少超過20%。更進一步，我們考慮了一個更實際的人機互動環境，模擬了良性與偏見用戶與助手型大型語言模型的互動。值得注意的是，我們發現，僅需10%的偏見用戶群體，助手型模型就可能無意中被偏差化，從而加劇其不誠實行為。總之，我們將突發性偏差的研究延伸至高風險情境下的不誠實與欺騙領域，並證明這種風險不僅通過直接微調產生，也在下游混合任務和實際的人機互動中顯現。

English

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

大型語言模型無意中學會欺騙：從樣本錯位到偏見人機互動中的欺騙性錯位湧現

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

摘要

Support