大语言模型中的说服动力学:基于DuET-PD框架探究知识与安全性的鲁棒性与适应性
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
August 24, 2025
作者: Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
cs.AI
摘要
大型语言模型(LLMs)在说服性对话中往往难以平衡对错误信息的轻信与对有效纠正的抗拒,这是其可靠部署面临的关键挑战。为此,我们提出了DuET-PD(双维度信任评估框架),该框架通过双重维度评估多轮对话中的立场变化动态:说服类型(纠正性/误导性)和领域(基于MMLU-Pro的知识领域,以及基于SALAD-Bench的安全领域)。研究发现,即便是GPT-4o这样的顶尖模型,在持续误导性说服下,其在MMLU-Pro上的准确率也仅为27.32%。此外,结果还揭示了一个令人担忧的趋势:较新的开源模型呈现出日益增强的迎合性。为解决这一问题,我们引入了Holistic DPO训练方法,该方法平衡了正面与负面说服示例。与提示或仅抗性训练不同,Holistic DPO不仅增强了对错误信息的鲁棒性,还提升了对纠正的接受度,使得Llama-3.1-8B-Instruct在安全语境下面对误导性说服时的准确率从4.21%大幅提升至76.54%。这些贡献为开发更可靠、适应性更强的多轮对话LLMs提供了路径。代码已发布于https://github.com/Social-AI-Studio/DuET-PD。
English
Large Language Models (LLMs) can struggle to balance gullibility to
misinformation and resistance to valid corrections in persuasive dialogues, a
critical challenge for reliable deployment. We introduce DuET-PD (Dual
Evaluation for Trust in Persuasive Dialogues), a framework evaluating
multi-turn stance-change dynamics across dual dimensions: persuasion type
(corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via
SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves
only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
Moreover, results reveal a concerning trend of increasing sycophancy in newer
open-source models. To address this, we introduce Holistic DPO, a training
approach balancing positive and negative persuasion examples. Unlike prompting
or resist-only training, Holistic DPO enhances both robustness to
misinformation and receptiveness to corrections, improving
Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts
from 4.21% to 76.54%. These contributions offer a pathway to developing more
reliable and adaptable LLMs for multi-turn dialogue. Code is available at
https://github.com/Social-AI-Studio/DuET-PD.