ChatPaper.aiChatPaper

智能体安全的盲区:良性用户指令如何暴露计算机使用代理的关键漏洞

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

April 12, 2026
作者: Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao
cs.AI

摘要

当前计算机使用代理(CUA)已能在真实数字环境中自主完成复杂任务,但若被误导,它们也可能被程序化用于实施有害行为。现有安全评估主要针对滥用和提示注入等显性威胁,却忽视了一种微妙而关键的情境——用户指令完全善意,但危害源于任务上下文或执行结果。我们推出OS-BLIND基准测试,在非预期攻击条件下评估CUA安全性,该基准包含12个类别、8种应用场景和2大威胁集群(环境嵌入威胁与代理主动危害)下的300项人工设计任务。对前沿模型和代理框架的评估表明,大多数CUA的攻击成功率(ASR)超过90%,即使经过安全对齐的Claude 4.5 Sonnet也达到73.0%的ASR。更值得注意的是,当Claude 4.5 Sonnet部署于多智能体系统时,该漏洞会进一步加剧——ASR从73.0%升至92.7%。分析还发现,现有安全防御机制在用户指令善意时保护有限:安全对齐主要在前几个步骤激活,后续执行中很少重新介入;而在多智能体系统中,任务分解会掩盖有害意图,导致安全对齐模型失效。我们将公开OS-BLIND基准,以推动学界进一步研究并解决这些安全挑战。
English
Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.
PDF111April 16, 2026