LLMは無意識に欺くことを学習する：不誠実なサンプルから偏った人間-AI相互作用へと至る、偶発的なミスアラインメント

要旨

従来の研究では、狭い領域（例えば、安全でないコードや誤った医療アドバイス）における悪意あるまたは誤った補完データでファインチューニングされた大規模言語モデル（LLM）が、有害な行動を示すように広範にミスアライメントされる可能性があることが示されており、これを「創発的ミスアライメント」と呼びます。本研究では、この現象が安全性に関する行動を超えて、高リスクシナリオ（例えば、プレッシャー下での嘘や欺瞞的行動）における不誠実さや欺瞞の広範な領域にまで拡張されるかどうかを調査します。これを探るため、オープンソースのLLMを多様な領域におけるミスアライメントされた補完データでファインチューニングします。実験結果から、LLMが不誠実さにおいて広範にミスアライメントされた行動を示すことが明らかになりました。さらに、下流タスクにおける複合ファインチューニング設定でこの現象を探り、標準的な下流タスクにわずか1%のミスアライメントデータを導入するだけで、誠実な行動が20%以上減少することを発見しました。また、より実践的な人間-AI相互作用環境を考慮し、良性および偏見のあるユーザーをシミュレートしてアシスタントLLMと相互作用させます。特に、偏見のあるユーザーが10%存在するだけで、アシスタントが意図せずにミスアライメントされ、不誠実さが悪化する可能性があることがわかりました。要約すると、本研究は創発的ミスアライメントの研究を高リスクシナリオにおける不誠実さや欺瞞の領域に拡張し、このリスクが直接的なファインチューニングだけでなく、下流の混合タスクや実践的な人間-AI相互作用においても生じることを実証しました。

English

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

LLMは無意識に欺くことを学習する：不誠実なサンプルから偏った人間-AI相互作用へと至る、偶発的なミスアラインメント

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

要旨

Support