ChatPaper.aiChatPaper

教授模型平衡抵制和接受说服

Teaching Models to Balance Resisting and Accepting Persuasion

October 18, 2024
作者: Elias Stengel-Eskin, Peter Hase, Mohit Bansal
cs.AI

摘要

大型语言模型(LLMs)容易受到说服,当模型面对对抗性对话者时可能会带来风险。我们迈出了保护模型免受说服影响的第一步,同时主张对抗负面说服只是问题的一半:模型还应该能够接受有益的(即正面的)说服以改进其答案。我们展示了仅优化模型一侧会导致在另一侧表现不佳。为了平衡正面和负面说服,我们引入了平衡说服训练(Persuasion-Balanced Training,PBT),利用多智能体递归对话树创建数据,并通过偏好优化训练模型在适当时接受说服。PBT不断提高对错误信息的抵抗力和对挑战的弹性,同时在包含正面和负面说服的整体数据上表现最佳。至关重要的是,我们展示了PBT模型在多智能体辩论中是更好的队友。我们发现,没有PBT,强弱模型对的性能不稳定,模型呈现答案的顺序决定了团队获得强模型还是弱模型的性能。PBT导致更好和更稳定的结果,减少了顺序依赖性,强模型始终稳定地提升弱模型。
English
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

Summary

AI-Generated Summary

PDF32November 16, 2024