エージェント安全性の盲点：良性ユーザ指示がコンピュータ利用エージェントの重大な脆弱性を露呈する仕組み

要旨

コンピュータ利用エージェント（CUA）は、現実のデジタル環境において複雑なタスクを自律的に遂行できるようになったが、誤導されるとプログラムによる有害行動の自動化にも悪用され得る。既存の安全性評価は、悪用やプロンプトインジェクションなどの明示的脅威を主対象としており、ユーザ指示自体は完全に良性でありながら、タスクの文脈や実行結果から危害が生じる微妙かつ重大な設定が見落とされている。本研究では、意図しない攻撃状況下でのCUAを評価するベンチマークOS-BLINDを提案する。これは12カテゴリ・8アプリケーション・2つの脅威クラスター（環境埋め込み型脅威とエージェント起因型危害）にわたる300件の人手作成タスクで構成される。先進モデルとエージェントフレームワークによる評価では、大多数のCUAで攻撃成功率（ASR）が90%を超え、安全性調整済みのClaude 4.5 Sonnetでも73.0%のASRを示した。さらに興味深いことに、Claude 4.5 Sonnetをマルチエージェントシステムで運用するとASRが73.0%から92.7%に上昇し、この脆弱性がより深刻化することが明らかになった。分析により、ユーザ指示が良性の場合、既存の安全防御策の効果は限定的であることも示された。安全性調整は主に初期数ステップで発動し、その後の実行中に再作動することは稀である。マルチエージェントシステムでは、細分化されたサブタスクがモデルから有害意図を隠蔽し、安全性調整済みモデルを失敗させる。我々はOS-BLINDを公開し、広範な研究コミュニティがこれらの安全課題の調査と解決を推進することを促進する。

English

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

エージェント安全性の盲点：良性ユーザ指示がコンピュータ利用エージェントの重大な脆弱性を露呈する仕組み

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

要旨

Support