에이전트 안전의 사각지대: 양해된 사용자 지시가 컴퓨터 사용 에이전트의 치명적 취약점을 드러내는 방식

초록

컴퓨터 사용 에이전트(CUA)는 이제 실제 디지털 환경에서 복잡한 작업을 자율적으로 완료할 수 있지만, 잘못된 정보를 제공받을 경우 유해한 행동을 프로그램적으로 자동화하는 데 악용될 수도 있습니다. 기존의 안전성 평가는 주로 오용 및 프롬프트 인젝션과 같은 명시적 위협을 대상으로 하며, 사용자 지시 자체는 완전히 무해하지만 작업 맥락이나 실행 결과에서 해악이 발생하는 미묘하지만 중요한 상황을 간과하고 있습니다. 본 연구에서는 의도하지 않은 공격 조건에서 CUA의 성능을 평가하는 벤치마크인 OS-BLIND를 소개합니다. OS-BLIND는 12개 범주, 8개 애플리케이션, 2개 위협 군집(환경 내재 위협과 에이전트 주도 피해)에 걸친 300개의 인간이 설계한 작업으로 구성됩니다. 최첨단 모델과 에이전트 프레임워크에 대한 평가 결과, 대부분의 CUA가 90% 이상의 공격 성공률(ASR)을 보였으며, 안전성에 중점을 둔 Claude 4.5 Sonnet조차 73.0%의 ASR에 도달했습니다. 더 흥미로운 점은 Claude 4.5 Sonnet이 다중 에이전트 시스템에 배포될 경우 이 취약성이 더욱 심각해져 ASR이 73.0%에서 92.7%로 상승했다는 것입니다. 우리의 분석은 추가적으로 사용자 지시가 무해할 때 기존 안전 방어 메커니즘이 제한적인 보호만을 제공함을 보여줍니다. 안전성 정렬은 주로 초기 몇 단계 내에서 활성화되며 이후 실행 과정에서는 거의 재발동하지 않습니다. 다중 에이전트 시스템에서는 분해된 하위 작업이 모델로부터 유해한 의도를 숨겨 안전성 정렬 모델이 실패하게 만듭니다. 우리는 OS-BLIND를 공개하여 더 넓은 연구 커뮤니티가 이러한 안전 과제를 추가로 조사하고 해결하도록 장려할 계획입니다.

English

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

에이전트 안전의 사각지대: 양해된 사용자 지시가 컴퓨터 사용 에이전트의 치명적 취약점을 드러내는 방식

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

초록

Support