大型語言模型中的說服力動態:基於DuET-PD探討知識與安全性的穩健性與適應性
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
August 24, 2025
作者: Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
cs.AI
摘要
大型語言模型(LLMs)在說服性對話中,往往難以平衡對錯誤信息的輕信與對有效糾正的抗拒,這對其可靠部署構成了關鍵挑戰。我們提出了DuET-PD(雙重評估信任於說服性對話),這是一個評估多輪立場變化動態的框架,涵蓋雙重維度:說服類型(糾正性/誤導性)和領域(通過MMLU-Pro的知識,以及通過SALAD-Bench的安全性)。我們發現,即使是像GPT-4o這樣的頂尖模型,在持續的誤導性說服下,於MMLU-Pro中的準確率也僅為27.32%。此外,結果顯示,新開源模型中的迎合趨勢日益嚴重。為解決這一問題,我們引入了Holistic DPO,一種平衡正面與負面說服示例的訓練方法。與提示或僅抵抗訓練不同,Holistic DPO增強了對錯誤信息的魯棒性和對糾正的接受度,將Llama-3.1-8B-Instruct在安全情境下受誤導性說服時的準確率從4.21%提升至76.54%。這些貢獻為開發更可靠、適應性更強的多輪對話LLMs提供了路徑。代碼可在https://github.com/Social-AI-Studio/DuET-PD獲取。
English
Large Language Models (LLMs) can struggle to balance gullibility to
misinformation and resistance to valid corrections in persuasive dialogues, a
critical challenge for reliable deployment. We introduce DuET-PD (Dual
Evaluation for Trust in Persuasive Dialogues), a framework evaluating
multi-turn stance-change dynamics across dual dimensions: persuasion type
(corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via
SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves
only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
Moreover, results reveal a concerning trend of increasing sycophancy in newer
open-source models. To address this, we introduce Holistic DPO, a training
approach balancing positive and negative persuasion examples. Unlike prompting
or resist-only training, Holistic DPO enhances both robustness to
misinformation and receptiveness to corrections, improving
Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts
from 4.21% to 76.54%. These contributions offer a pathway to developing more
reliable and adaptable LLMs for multi-turn dialogue. Code is available at
https://github.com/Social-AI-Studio/DuET-PD.