基于失败轨迹的同策略智能体安全对齐自演化

摘要

使用工具的LLM智能体的失败通常源于其行为轨迹，而非仅最终响应——它们可能执行不安全的工具调用、遵循注入指令、顺从有害请求，或过度拒绝良性任务，即便生成了看似安全的答案。现有的安全对齐信号主要基于响应层面或离策略（off-policy），且常引发安全性与实用性之间的权衡：提升智能体安全性往往以任务性能下降为代价。这种稀疏的单目标奖励机制严重限制了实际可用性。为弥补这一缺口，我们提出FATE框架——一种在策略（on-policy）自进化框架，无需专家示范即可将验证器评分的失败转化为修复监督信号。针对每次失败，同一策略生成修复候选方案，随后由验证器重新评分，并从安全性、实用性、过度拒绝控制及轨迹有效性四个维度进行筛选。这些密集的轨迹级信息随后作为智能体自进化的监督信号。在此过程中，我们进一步引入帕累托前沿策略优化（PFPO），将监督式预热与帕累托感知策略优化相结合，以维护安全性与实用性的平衡。在AgentDojo、AgentHarm和ATBench上的实验表明，FATE在不同模型与规模下均能提升安全性，同时保留有效行为。与强基线相比，FATE将攻击成功率降低33.5%，有害顺从率降低82.6%，并将外部轨迹安全诊断能力提升6.5%。这些结果表明，失败轨迹可为更安全的自进化智能体提供结构化修复监督。

English

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

基于失败轨迹的同策略智能体安全对齐自演化

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

摘要

Support