安全協作之舞：聯合訓練智能體以實現協同安全

摘要

駕馭大型語言模型（LLMs）的力量，需要在提供幫助與保持無害之間進行精妙的平衡。這在兩大相互競爭的挑戰間形成了根本性的張力：一是易受對抗性攻擊誘導出不安全內容的脆弱性，二是對良性但敏感提示過度拒絕的傾向。現有方法通常依賴於安全防護模型，這些模型會完全拒絕任何包含不安全部分的內容。這種做法如同徹底切斷了音樂——它可能加劇過度拒絕，並未能為其拒絕的查詢提供細緻的指導。為了教會模型更協調的“舞步”，我們提出了WaltzRL，這是一種新穎的多智能體強化學習框架，將安全對齊構建為一種協作、正和博弈。WaltzRL聯合訓練一個對話智能體和一個反饋智能體，後者被激勵提供有用的建議，以提升對話智能體回應的安全性和幫助性。WaltzRL的核心是動態改進獎勵（DIR），它根據對話智能體整合反饋的效果隨時間演進。在推理階段，對話智能體的不安全或過度拒絕的回應會被改進而非直接丟棄。反饋智能體與對話智能體一同部署，僅在需要時自適應地介入，確保在安全查詢上保持幫助性和低延遲。我們在五個多樣化數據集上的實驗表明，與多種基線相比，WaltzRL顯著減少了不安全回應（例如，在WildJailbreak上從39.0%降至4.6%）和過度拒絕（在OR-Bench上從45.3%降至9.9%）。通過促使對話與反饋智能體共同進化並自適應地應用反饋，WaltzRL在不削弱通用能力的前提下增強了LLM的安全性，從而推動了幫助性與無害性之間的帕累托前沿。

English

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

安全協作之舞：聯合訓練智能體以實現協同安全

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

摘要

Support