Code Agentはエンドツーエンドのシステムハッカーとなり得る：コンピュータ利用エージェントの現実世界における脅威のベンチマーキング

要旨

大規模言語モデル（LLM）またはマルチモーダルLLM（MLLM）を基盤とするコンピュータ利用エージェント（CUA）フレームワークは、ソフトウェア環境内で直接的に文脈を認識し、推論し、行動するアシスタントとして急速に成熟しています。その中でも最も重要な応用分野の一つが、オペレーティングシステム（OS）制御です。OS領域におけるCUAが日常業務に深く組み込まれるにつれ、その現実世界におけるセキュリティへの影響、特にCUAが現実的なセキュリティ関連の攻撃に悪用される可能性を検証することが急務となっています。既存の研究には、攻撃者の戦術・技術・手順（TTP）に関する知識モデルの欠如、エンドツーエンドのキルチェーンに対する不完全なカバレッジ、マルチホスト環境や暗号化されたユーザー認証情報を考慮しない非現実的な環境、LLMを裁判官として依存する信頼性の低い判断という4つの主要な課題があります。これらのギャップを埋めるため、我々はMITRE ATT&CK Enterprise Matrixに基づいた現実世界のTTPに沿った最初のベンチマークであるAdvCUAを提案します。AdvCUAは、40の直接的な悪意のあるタスク、74のTTPベースの悪意のあるタスク、26のエンドツーエンドのキルチェーンを含む140のタスクで構成され、マルチホスト環境のサンドボックス内でハードコードされた評価を通じて、現実的な企業OSセキュリティ脅威の下でCUAを体系的に評価します。我々は、ReAct、AutoGPT、Gemini CLI、Cursor CLI、Cursor IDEを含む既存の5つの主要なCUAを、8つの基盤LLMに基づいて評価しました。その結果、現在の最先端のCUAは、OSセキュリティ中心の脅威を十分にカバーしていないことが明らかになりました。CUAのこれらの能力は、カスタムマルウェアや深いドメイン知識への依存を軽減し、経験の浅い攻撃者でも複雑な企業侵入を実行できるようにするため、CUAの責任とセキュリティに関する社会的懸念を引き起こしています。

English

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.