ChatPaper.aiChatPaper

用於智能體安全對齊的基於失敗軌跡的同策略自我演化

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

May 12, 2026
作者: Bo Yin, Qi Li, Xinchao Wang
cs.AI

摘要

使用工具的大型語言模型代理不僅僅在最終回應時失敗,而是在整個操作軌跡中表現出差錯——它們可能執行不安全的工具調用、遵循注入的指令、順從有害請求,或對善意的任務過度拒絕,儘管表面上給出了看似安全的回答。現有的安全性對齊信號大多基於回應層級或脫離當前策略,且常伴隨安全性與實用性之間的取捨:提升代理安全性往往會犧牲任務執行效能。這種稀疏且單一目標的獎勵機制嚴重限制了實際應用的可行性。為解決此問題,我們提出FATE,一種基於當前策略的自我演化框架,能將驗證器評分標記的失敗案例轉化為修復監督訊號,無需專家示範。針對每次失敗,同一策略會提出修復候選方案,再由驗證器重新評分,並依據安全性、實用性、過度拒絕控制及軌跡有效性進行篩選。這些密集的軌跡層級資訊隨後被用作代理自我演化的監督信號。在此過程中,我們進一步引入帕累托前沿策略優化(PFPO),結合監督式暖啟動與帕累托感知的策略優化,以維護安全性與實用性之間的平衡。在AgentDojo、AgentHarm及ATBench上的實驗顯示,FATE能在不同模型與規模下提升安全性,同時保留有用的行為。與強基線相比,FATE將攻擊成功率降低了33.5%,有害順從度降低了82.6%,並將外部軌跡安全性診斷提升了6.5%。這些結果表明,失敗的軌跡能為更安全的自我演化代理提供結構化的修復監督訊號。
English
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.
PDF130May 14, 2026