BraveGuard：从开放世界威胁到更安全的计算机使用代理

摘要

计算机使用代理将语言模型从文本生成扩展到与文件、终端、浏览器和外部工具的持续交互。这一转变带来了安全风险，这些风险难以通过孤立的提示或最终响应来检测，因为危害往往仅通过多步执行轨迹显现，而其中的单个动作在局部看似无害。我们引入了BraveGuard，这是一种自我进化的防御框架，用于从开放世界的威胁信号和真实的代理轨迹中训练防护模型。BraveGuard挖掘近期研究来源以识别新兴风险和攻击模式，将其实例化为可执行的计算机使用任务，收集代理的推演结果，并推导出轨迹级监督信号用于防护模型训练。随着新威胁和验证失败的出现，该流程可重复执行，从而形成一个自适应的防御循环，而非静态的、基准驱动的训练过程。我们通过训练多种防护骨干模型（包括Qwen3-Guard和Llama-Guard变体）来实例化BraveGuard，并在轨迹级代理安全基准上评估由此产生的防护模型。BraveGuard在计算机使用轨迹上持续改进安全检测性能。在AgentHazard基准上，相比现成的防护模型，其检测精度大幅提升，在平均防护模型设置下准确率从38.79%提升至82.38%。这些结果表明，基于开放世界威胁发现和真实代理执行的防护监督能够超越固定的分类体系和合成提示级数据，改进安全监控。BraveGuard为面临不断演变的现实世界风险的计算机使用代理提供了一条可扩展的自适应防御路径。

English

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.