失敗軌跡によるオン方策自己進化を用いたエージェンティック安全性調整

要旨

ツールを使用するLLMエージェントは、最終的な応答だけでなく、軌跡（trajectories）全体を通じて失敗する。これは、一見安全な回答を生成しているにもかかわらず、安全でないツール呼び出しの実行、注入された指示への追従、有害な要求への同意、または良性タスクに対する過剰な拒否などが発生しうるためである。既存の安全性アライメント信号は、主に応答レベルまたはオフポリシー（off-policy）であり、しばしば安全性とユーティリティのトレードオフを招く。すなわち、エージェントの安全性を向上させると、タスク性能が低下する。このような疎（sparse）で単一目的の報酬は、実世界での有用性を著しく制限する。このギャップを埋めるため、本論文ではFATEを提案する。これは、専門家によるデモンストレーションを必要とせずに、検証器（verifier）がスコア付けした失敗を修復の教師信号に変換する、オンポリシー（on-policy）の自己進化型フレームワークである。各失敗に対して、同一の方策が修復候補を提案し、それらを検証器が再スコアリングし、セキュリティ、ユーティリティ、過剰拒否制御、軌跡の妥当性に基づいてフィルタリングする。この密（dense）な軌跡レベルの情報は、エージェントの自己進化のための教師信号として利用される。このプロセスにおいて、さらにパレートフロント方策最適化（Pareto-Front Policy Optimization, PFPO）を導入する。これは、教師ありウォームアップとパレート対応方策最適化を組み合わせ、安全性とユーティリティのトレードオフを維持するものである。AgentDojo、AgentHarm、ATBenchを用いた実験により、FATEは異なるモデルや規模において安全性を向上させつつ、有用な振る舞いを維持することが示された。強力なベースラインと比較して、FATEは攻撃成功率を33.5%、有害なコンプライアンスを82.6%低減し、外部の軌跡安全性診断を6.5%向上させた。これらの結果は、失敗した軌跡が、より安全な自己進化型エージェントのための構造化された修復教師信号を提供できることを示唆している。

English

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

失敗軌跡によるオン方策自己進化を用いたエージェンティック安全性調整

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

要旨

Support