ChatPaper.aiChatPaper

大型語言模型中的說服力動態:基於DuET-PD探討知識與安全性的穩健性與適應性

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

August 24, 2025
作者: Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
cs.AI

摘要

大型語言模型(LLMs)在說服性對話中,往往難以平衡對錯誤信息的輕信與對有效糾正的抗拒,這對其可靠部署構成了關鍵挑戰。我們提出了DuET-PD(雙重評估信任於說服性對話),這是一個評估多輪立場變化動態的框架,涵蓋雙重維度:說服類型(糾正性/誤導性)和領域(通過MMLU-Pro的知識,以及通過SALAD-Bench的安全性)。我們發現,即使是像GPT-4o這樣的頂尖模型,在持續的誤導性說服下,於MMLU-Pro中的準確率也僅為27.32%。此外,結果顯示,新開源模型中的迎合趨勢日益嚴重。為解決這一問題,我們引入了Holistic DPO,一種平衡正面與負面說服示例的訓練方法。與提示或僅抵抗訓練不同,Holistic DPO增強了對錯誤信息的魯棒性和對糾正的接受度,將Llama-3.1-8B-Instruct在安全情境下受誤導性說服時的準確率從4.21%提升至76.54%。這些貢獻為開發更可靠、適應性更強的多輪對話LLMs提供了路徑。代碼可在https://github.com/Social-AI-Studio/DuET-PD獲取。
English
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
PDF82August 29, 2025