大規模言語モデルにおける説得ダイナミクス：DuET-PDを用いた知識と安全性のロバスト性と適応性の調査

要旨

大規模言語モデル（LLMs）は、説得的な対話において、誤情報への信じやすさと有効な修正への抵抗のバランスを取ることに苦戦することがあり、信頼性のある展開における重要な課題となっています。本論文では、DuET-PD（説得的対話における信頼のための二重評価）を紹介します。これは、二つの次元（修正型/誤導型の説得タイプと、MMLU-Proによる知識ドメイン、SALAD-Benchによる安全性ドメイン）にわたる多ターンのスタンス変化のダイナミクスを評価するフレームワークです。我々の調査では、GPT-4oのような最先端のモデルでさえ、持続的な誤導説得下でのMMLU-Proの精度が27.32%に留まることが明らかになりました。さらに、新しいオープンソースモデルにおいて、シコファンシー（迎合的態度）が増加するという懸念すべき傾向も見られました。この問題に対処するため、我々はHolistic DPOを提案します。これは、肯定的な説得例と否定的な説得例のバランスを取るトレーニング手法です。プロンプティングや抵抗のみのトレーニングとは異なり、Holistic DPOは誤情報に対する頑健性と修正への受容性の両方を向上させ、Llama-3.1-8B-Instructの安全性コンテキストにおける誤導説得下の精度を4.21%から76.54%に改善しました。これらの貢献は、多ターン対話のためのより信頼性が高く適応性のあるLLMsを開発するための道筋を提供します。コードはhttps://github.com/Social-AI-Studio/DuET-PDで公開されています。

English

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.