BraveGuard: 개방형 세계 위협에서 더 안전한 컴퓨터 사용 에이전트로

초록

컴퓨터 사용 에이전트는 언어 모델을 텍스트 생성에서 파일, 터미널, 브라우저, 외부 도구와의 지속적인 상호작용으로 확장한다. 이러한 변화는 개별 프롬프트나 최종 응답만으로는 탐지하기 어려운 안전 위험을 야기하는데, 피해는 종종 각 개별 행동이 국지적으로 무해해 보이는 다단계 실행 궤적을 통해서만 드러나기 때문이다. 우리는 오픈월드 위협 신호와 현실적인 에이전트 궤적으로부터 가드 모델을 훈련시키기 위한 자기 진화형 방어 프레임워크인 BraveGuard를 소개한다. BraveGuard는 최신 연구 자료를 분석하여 신흥 위험과 공격 패턴을 식별하고, 이를 실행 가능한 컴퓨터 사용 과제로 구체화하며, 에이전트 롤아웃을 수집하고, 가드 모델 훈련을 위한 궤적 수준의 감독 신호를 도출한다. 새로운 위협과 검증 실패가 나타나면 파이프라인을 반복할 수 있어, 고정된 벤치마크 기반 훈련 과정이 아닌 적응형 방어 루프를 생성한다. 우리는 Qwen3-Guard 및 Llama-Guard 변형을 포함한 여러 가드 백본을 훈련시켜 BraveGuard를 구현하고, 결과 가드 모델을 궤적 수준의 에이전트 안전 벤치마크에서 평가한다. BraveGuard는 컴퓨터 사용 궤적 전반에 걸쳐 안전 탐지를 일관되게 개선한다. AgentHazard에서는 기성 가드 모델 대비 탐지 정확도가 크게 향상되어, 평균 가드 모델 설정에서 정확도가 38.79%에서 82.38%로 증가한다. 이러한 결과는 오픈월드 위협 발견과 현실적인 에이전트 실행에 기반한 가드 감독이 고정된 분류 체계나 합성 프롬프트 수준 데이터를 넘어 안전 모니터링을 개선할 수 있음을 보여준다. BraveGuard는 진화하는 현실 세계 위험에 직면한 컴퓨터 사용 에이전트를 위한 적응형 방어로 가는 확장 가능한 경로를 제공한다.

English

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.