AgentHazard：计算机使用智能体有害行为评估基准

摘要

计算机使用型智能体将语言模型从文本生成扩展到对工具、文件和执行环境的持续操作。与聊天系统不同，这类智能体能在多次交互间保持状态，并将中间输出转化为具体行动。这带来了独特的安全挑战：有害行为可能通过一系列单独看似合理的步骤显现，包括那些局部可接受但共同导致未授权操作的中间动作。我们提出AgentHazard这一评估计算机使用型智能体有害行为的基准测试，包含涵盖不同风险类别和攻击策略的2,653个测试实例。每个实例将有害目标与操作步骤序列配对，这些步骤局部合法但共同诱发不安全行为。该基准通过四类测试场景（累积上下文、重复工具使用、中间动作、跨步骤依赖）评估智能体识别和阻断潜在危害的能力。我们在Claude Code、OpenClaw和IFlow系统上进行了测试，主要采用来自Qwen3、Kimi、GLM和DeepSeek系列的开源或可公开部署模型。实验结果表明现有系统仍存在显著漏洞：当搭载Qwen3-Coder时，Claude Code的攻击成功率高达73.63%，这证明仅靠模型对齐技术无法可靠保障自主智能体的安全性。

English

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63\%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

AgentHazard：计算机使用智能体有害行为评估基准

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

摘要

Support