에이전트 안전 정렬을 위한 실패 궤적을 통한 온-정책 자기 진화

초록

도구를 사용하는 LLM 에이전트는 최종 응답뿐만 아니라 궤적을 통해서도 실패한다. 겉보기에는 안전한 답변을 생성하더라도 안전하지 않은 도구 호출을 실행하거나, 주입된 명령을 따르거나, 유해한 요청에 응하거나, 무해한 작업을 과도하게 거부할 수 있기 때문이다. 기존의 안전 정렬 신호는 대부분 응답 수준 또는 오프-정책(off-policy) 방식이며, 종종 안전-유용성 간의 절충을 초래한다. 즉, 에이전트의 안전성을 개선하면 작업 성능이 저하되는 비용이 따른다. 이러한 희소하고 단일 목표의 보상은 실제 사용성을 심각하게 제한한다. 이러한 격차를 해소하기 위해, 우리는 FATE(온-정책 자기 진화 프레임워크)를 제안한다. 이는 검증기가 점수를 매긴 실패를 전문가 시연 없이 복구 감독 신호로 변환한다. 각 실패에 대해 동일한 정책이 복구 후보를 제안하고, 이 후보들은 검증기에 의해 다시 점수가 매겨져 보안, 유용성, 과도한 거부 제어, 궤적 유효성 측면에서 필터링된다. 이렇게 조밀한 궤적 수준의 정보는 에이전트 자기 진화를 위한 감독 신호로 사용된다. 이 과정에서 우리는 추가로 파레토 전면 정책 최적화(PFPO)를 도입하여, 지도 워밍업과 파레토 인식 정책 최적화를 결합함으로써 안전-유용성 절충을 보존한다. AgentDojo, AgentHarm, ATBench에서의 실험 결과, FATE는 다양한 모델과 규모에서 유용한 행동을 유지하면서 안전성을 향상시킨다. 강력한 기준선과 비교하여 FATE는 공격 성공률을 33.5%, 유해한 순응도를 82.6% 감소시켰으며, 외부 궤적 안전 진단을 6.5% 개선했다. 이러한 결과는 실패한 궤적이 더 안전한 자기 진화 에이전트를 위한 구조화된 복구 감독 신호를 제공할 수 있음을 시사한다.

English

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

에이전트 안전 정렬을 위한 실패 궤적을 통한 온-정책 자기 진화

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

초록

Support