BraveGuard: オープンワールドの脅威からより安全なコンピュータ操作エージェントへ

要旨

コンピュータ利用エージェントは、言語モデルをテキスト生成からファイル、端末、ブラウザ、外部ツールとの持続的なインタラクションへと拡張する。この移行により、個々の動作は局所的には無害に見えるものの、害が多段階の実行痕跡を通じて初めて顕在化するため、単独のプロンプトや最終応答からは検出が困難な安全性リスクが生じる。本稿では、オープンワールドの脅威シグナルと現実的なエージェント軌跡からガードモデルを訓練する自己進化的防御フレームワーク「BraveGuard」を提案する。BraveGuardは最新の研究ソースから新興リスクや攻撃パターンを特定し、それらを実行可能なコンピュータ利用タスクとして具体化し、エージェントのロールアウトを収集し、軌跡レベルの監視信号を導出してガードモデルの訓練に活用する。新たな脅威や検証の失敗が出現するたびにパイプラインを反復可能であり、静的でベンチマーク駆動型の訓練プロセスではなく、適応的な防御ループを実現する。本稿では、Qwen3-GuardやLlama-Guardの派生モデルを含む複数のガードバックボーンを訓練し、得られたガードモデルを軌跡レベルのエージェント安全性ベンチマークで評価する。BraveGuardはコンピュータ利用軌跡全体にわたって安全性検出を一貫して改善する。AgentHazardにおいては、既製のガードモデルと比較して検出精度が大幅に向上し、平均化ガードモデル設定では38.79%から82.38%に精度が上昇した。これらの結果は、オープンワールドの脅威発見と現実的なエージェント実行に基づくガード監視が、固定された分類体系や合成プロンプトレベルのデータを超えて安全性監視を改善できることを示している。BraveGuardは、進化する実世界リスクに直面するコンピュータ利用エージェントに対する適応的防御へのスケーラブルな道筋を提供する。

English

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.