大型語言模型中的說服力動態：基於DuET-PD探討知識與安全性的穩健性與適應性

摘要

大型語言模型（LLMs）在說服性對話中，往往難以平衡對錯誤信息的輕信與對有效糾正的抗拒，這對其可靠部署構成了關鍵挑戰。我們提出了DuET-PD（雙重評估信任於說服性對話），這是一個評估多輪立場變化動態的框架，涵蓋雙重維度：說服類型（糾正性/誤導性）和領域（通過MMLU-Pro的知識，以及通過SALAD-Bench的安全性）。我們發現，即使是像GPT-4o這樣的頂尖模型，在持續的誤導性說服下，於MMLU-Pro中的準確率也僅為27.32%。此外，結果顯示，新開源模型中的迎合趨勢日益嚴重。為解決這一問題，我們引入了Holistic DPO，一種平衡正面與負面說服示例的訓練方法。與提示或僅抵抗訓練不同，Holistic DPO增強了對錯誤信息的魯棒性和對糾正的接受度，將Llama-3.1-8B-Instruct在安全情境下受誤導性說服時的準確率從4.21%提升至76.54%。這些貢獻為開發更可靠、適應性更強的多輪對話LLMs提供了路徑。代碼可在https://github.com/Social-AI-Studio/DuET-PD獲取。

English

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

大型語言模型中的說服力動態：基於DuET-PD探討知識與安全性的穩健性與適應性

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

摘要

Support