ChatPaper.aiChatPaper

AgentHazard:计算机使用代理有害行为评估基准

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

April 3, 2026
作者: Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, Yanming Guo
cs.AI

摘要

计算机智能体将语言模型从文本生成扩展到对工具、文件及执行环境的持续操作。与聊天系统不同,这类智能体能在多次交互中保持状态,并将中间输出转化为具体行动。这带来了独特的安全挑战:有害行为可能通过一系列单独看似合理的步骤产生,包括那些局部可接受但共同导致越权操作的中间行为。我们提出AgentHazard基准测试框架,用于评估计算机智能体的有害行为。该框架包含2,653个测试案例,覆盖多种风险类别和攻击策略。每个案例将有害目标与操作步骤序列相结合,这些步骤单独合法但共同诱发不安全行为。该基准测试评估智能体能否识别并阻断由累积上下文、重复工具使用、中间操作及跨步骤依赖关系引发的危害。我们在Claude Code、OpenClaw和IFlow系统上对主要采用Qwen3、Kimi、GLM及DeepSeek系列开源或可公开部署模型的智能体进行测试。实验结果表明现有系统仍存在高度脆弱性。特别是搭载Qwen3-Coder的Claude Code攻击成功率高达73.63%,表明仅靠模型对齐技术并不能可靠保障自主智能体的安全性。
English
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63\%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.
PDF20April 7, 2026