ChatPaper.aiChatPaper

程式碼代理可成為端到端系統駭客:實測電腦使用代理的現實威脅

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

October 8, 2025
作者: Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao
cs.AI

摘要

基於大型語言模型(LLMs)或多模態LLMs(MLLMs)驅動的計算機使用代理(CUA)框架,正迅速成熟為能在軟件環境中直接感知上下文、推理並行動的助手。其最關鍵的應用之一便是操作系統(OS)控制。隨著CUA在OS領域日益融入日常操作,審視其現實世界中的安全影響變得至關重要,特別是CUA是否可能被濫用以實施真實且與安全相關的攻擊。現有研究存在四大侷限:缺乏針對戰術、技術和程序(TTP)的攻擊者知識模型,端到端殺傷鏈覆蓋不完整,缺乏多主機及加密用戶憑證的真實環境,以及依賴LLM作為判斷依據的不可靠性。為彌補這些不足,我們提出了AdvCUA,這是首個與MITRE ATT&CK企業矩陣中真實世界TTPs對齊的基準,包含140項任務,其中40項直接惡意任務,74項基於TTP的惡意任務,以及26項端到端殺傷鏈,通過硬編碼評估在多主機環境沙箱中系統性地評估CUA面臨的真實企業OS安全威脅。我們基於8個基礎LLM評估了現有的五種主流CUA,包括ReAct、AutoGPT、Gemini CLI、Cursor CLI和Cursor IDE。結果表明,當前前沿的CUA並未充分涵蓋以OS安全為核心的威脅。CUA的這些能力降低了對定制惡意軟件和深度領域專業知識的依賴,使得即使經驗不足的攻擊者也能發動複雜的企業入侵,這引發了社會對CUA責任與安全性的廣泛關注。
English
Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.
PDF32October 9, 2025