BraveGuard: 從開放世界威脅到更安全的電腦使用代理

摘要

電腦使用型智能體將語言模型的應用範疇，從單純的文字生成擴展至與檔案、終端機、瀏覽器及外部工具進行持續互動。此一轉變帶來了難以從單一提示或最終回應中察覺的安全風險，原因是危害往往僅在需多步驟執行的軌跡中浮現，而其中個別動作看似於局部無害。我們提出 BraveGuard，這是一個自我演化防禦框架，旨在從開放世界的威脅訊號與真實的智能體軌跡中訓練守護模型。BraveGuard 會挖掘近期研究資料，以識別新興風險與攻擊模式，將其具體化為可執行的電腦使用任務，收集智能體的軌跡展開結果，並導出軌跡層級的監督訊號，用以訓練守護模型。當出現新的威脅或驗證失敗時，此流程可重複執行，形成適應性的防禦循環，而非靜態、以基準測試為主的訓練過程。我們透過訓練多種守護模型主幹（包括 Qwen3-Guard 與 Llama-Guard 變體）來具體化 BraveGuard，並在軌跡層級的智能體安全基準測試中評估這些守護模型。BraveGuard 能持續提升電腦使用軌跡之安全性偵測表現。在 AgentHazard 基準測試中，相較於現成的守護模型，其偵測準確率大幅提升；在平均守護模型設定下，準確率從 38.79% 提升至 82.38%。這些結果顯示，立基於開放世界威脅發現與真實智能體執行的守護監督，能夠超越固定的分類架構與合成提示層級資料，改善安全監控。BraveGuard 為面臨不斷演進之真實世界風險的電腦使用型智能體，提供了一條可擴展的適應性防禦路徑。

English

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.