AgentHazard: 컴퓨터 사용 에이전트의 유해 행동 평가를 위한 벤치마크

초록

컴퓨터 사용 에이전트는 언어 모델을 단순 텍스트 생성에서 도구, 파일, 실행 환경에 대한 지속적 행동으로 확장합니다. 채팅 시스템과 달리, 이러한 에이전트는 상호작용 간 상태를 유지하고 중간 출력을 구체적 행동으로 변환합니다. 이는 개별적으로는 타당해 보이는 단계들의 연속을 통해 유해한 행동이 발생할 수 있다는 독특한 안전 문제를 야기합니다. 여기에는 지역적으로는 허용 가능해 보이지만 집합적으로는 비인가 행동으로 이어지는 중간 동작들이 포함됩니다. 본 논문에서는 컴퓨터 사용 에이전트의 유해한 행동을 평가하기 위한 벤치마크인 AgentHazard를 제시합니다. AgentHazard는 다양한 위험 범주와 공격 전략을 아우르는 2,653개의 인스턴스를 포함합니다. 각 인스턴스는 유해한 목표를 지역적으로는 정당하지만 전체적으로는 안전하지 않은 행동을 유발하는 일련의 운영 단계와 짝지어집니다. 이 벤치마크는 에이전트가 누적된 맥락, 반복된 도구 사용, 중간 동작, 단계 간 의존성에서 비롯된 위해를 인식하고 차단할 수 있는지 평가합니다. 우리는 AgentHazard를 Claude Code, OpenClaw, IFlow에 대해 Qwen3, Kimi, GLM, DeepSeek 계열의 대부분 오픈 또는 공개 배포 가능한 모델을 사용하여 평가했습니다. 실험 결과에 따르면 현재 시스템들은 여전히 매우 취약한 것으로 나타났습니다. 특히 Qwen3-Coder를 기반으로 할 때 Claude Code는 73.63%의 공격 성공률을 보여주었으며, 이는 모델 얼라인먼트만으로는 자율 에이전트의 안전을 안정적으로 보장하기 어렵다는 것을 시사합니다.

English

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63\%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

AgentHazard: 컴퓨터 사용 에이전트의 유해 행동 평가를 위한 벤치마크

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

초록

Support