对齐之舞：协同训练智能体以实现安全合作

摘要

驾驭大语言模型（LLMs）的力量，需在助益与无害之间精妙平衡。这便引发了两大挑战间的根本张力：一是易受对抗性攻击诱导产生不安全内容，二是对无害却敏感提示的过度拒绝倾向。当前方法常依赖安全防护模型，彻底屏蔽任何含不安全成分的内容，此举无异于中断了舞曲——不仅可能加剧过度拒绝，还无法为被拒查询提供细致指导。为教会模型更协调的舞步，我们提出了WaltzRL，一个新颖的多智能体强化学习框架，将安全对齐视为协作共赢的游戏。WaltzRL同时训练对话智能体与反馈智能体，后者被激励提供有益建议，以提升对话智能体回应的安全性与助益性。其核心在于动态改进奖励（DIR），该奖励随时间演进，依据对话智能体采纳反馈的效果而定。在推理阶段，对话智能体的不安全或过度拒绝回应会被改进而非直接丢弃。反馈智能体与对话智能体协同部署，仅在必要时自适应介入，确保安全查询的助益性与低延迟。我们在五个多样化数据集上的实验表明，相较于多种基线，WaltzRL显著减少了不安全回应（如在WildJailbreak上从39.0%降至4.6%）和过度拒绝（在OR-Bench上从45.3%降至9.9%）。通过促使对话与反馈智能体共同进化并自适应应用反馈，WaltzRL在不削弱通用能力的前提下增强了LLM的安全性，从而推进了助益与无害之间的帕累托前沿。

English

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

对齐之舞：协同训练智能体以实现安全合作

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

摘要

Support