AgentHazard：コンピュータ利用エージェントの有害行動評価のためのベンチマーク

要旨

コンピュータ利用エージェントは、言語モデルを単なるテキスト生成から、ツールやファイル、実行環境に対する持続的な行動へと拡張する。チャットシステムとは異なり、これらのエージェントはインタラクションを超えて状態を維持し、中間的な出力を具体的な行動へと変換する。この性質は、個々には妥当に見える一連のステップ、すなわち局所的には許容可能だが集合的には不正な行動につながる中間アクションを通じて、有害な振る舞いが生じる可能性があるという、独特の安全性課題を生み出す。本論文では、コンピュータ利用エージェントの有害行動を評価するベンチマーク「AgentHazard」を提案する。AgentHazardは、多様なリスクカテゴリと攻撃戦略にまたがる2,653のインスタンスから構成される。各インスタンスは、有害な目的と、局所的には正当だが連鎖的に不安全な行動を誘発する一連の操作ステップを組み合わせたものである。本ベンチマークは、蓄積されたコンテキスト、ツールの反復使用、中間アクション、ステップ間の依存関係から生じる危害をエージェントが認識し中断できるかどうかを評価する。我々は、Claude Code、OpenClaw、IFlowに対して、主にQwen3、Kimi、GLM、DeepSeekファミリのオープンまたはオープンにデプロイ可能なモデルを用いてAgentHazardを評価した。実験結果は、現行のシステムが依然として非常に脆弱であることを示している。特に、Qwen3-Coderを搭載したClaude Codeは73.63%の攻撃成功率を示し、モデルのアラインメントのみでは自律エージェントの安全性を確実に保証できないことが示唆された。

English

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63\%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

AgentHazard：コンピュータ利用エージェントの有害行動評価のためのベンチマーク

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

要旨

Support