アライメント・ワルツ：安全性のための共同訓練によるエージェントの協調

要旨

大規模言語モデル（LLM）の力を活用するには、有用性と無害性の間で繊細なバランスを取る必要があります。これにより、安全でないコンテンツを引き出す敵対的攻撃に対する脆弱性と、無害だがセンシティブなプロンプトに対する過剰な拒否傾向という、2つの相反する課題の間で根本的な緊張が生まれます。現在のアプローチでは、安全でない部分を含むコンテンツを完全に拒否するセーフガードモデルを用いてこのバランスを取ることが多いです。しかし、このアプローチは音楽を完全に止めてしまうようなもので、過剰な拒否を悪化させ、拒否したクエリに対するニュアンスのあるガイダンスを提供できない可能性があります。モデルにより協調的な振り付けを教えるために、我々はWaltzRLを提案します。これは、安全性の整合性を協力的で正和のゲームとして定式化する新しいマルチエージェント強化学習フレームワークです。WaltzRLは、会話エージェントとフィードバックエージェントを共同で訓練し、後者は会話エージェントの応答の安全性と有用性を向上させるための有用な提案を提供するようインセンティブを与えられます。WaltzRLの核心は、会話エージェントがフィードバックをどれだけうまく取り入れるかに基づいて時間とともに進化する動的改善報酬（DIR）です。推論時には、会話エージェントからの安全でないまたは過剰な拒否応答は破棄されるのではなく、改善されます。フィードバックエージェントは会話エージェントと一緒に展開され、必要な場合にのみ適応的に介入し、安全なクエリに対する有用性と低遅延を維持します。5つの多様なデータセットで実施した実験では、WaltzRLがさまざまなベースラインと比較して、安全でない応答（例：WildJailbreakで39.0%から4.6%へ）と過剰な拒否（OR-Benchで45.3%から9.9%へ）を大幅に減少させることが示されました。会話エージェントとフィードバックエージェントが共に進化し、適応的にフィードバックを適用することで、WaltzRLは一般的な能力を低下させることなくLLMの安全性を向上させ、有用性と無害性の間のパレートフロンティアを前進させます。

English

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

アライメント・ワルツ：安全性のための共同訓練によるエージェントの協調

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

要旨

Support