ChatPaper.aiChatPaper

代码代理可成为端到端系统黑客:评估计算机使用代理在现实世界中的威胁

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

October 8, 2025
作者: Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao
cs.AI

摘要

基于大型语言模型(LLMs)或多模态大型语言模型(MLLMs)驱动的计算机使用代理(CUA)框架,正迅速成熟为能够在软件环境中直接感知上下文、推理并执行操作的助手。其中,操作系统(OS)控制是其最为关键的应用之一。随着CUA在OS领域的应用日益深入日常操作,审视其现实世界中的安全影响变得至关重要,特别是探究CUA是否可能被滥用来实施真实且与安全相关的攻击。现有研究存在四大局限:缺乏针对战术、技术及程序(TTP)的攻击者知识模型,端到端攻击链覆盖不完整,未考虑多主机及加密用户凭证的不真实环境,以及依赖LLM作为评判标准的不确定性。为填补这些空白,我们提出了AdvCUA,这是首个与MITRE ATT&CK企业矩阵中真实世界TTPs对齐的基准测试,包含140项任务,其中40项直接恶意任务、74项基于TTP的恶意任务及26项端到端攻击链,通过硬编码评估在多主机环境沙箱中系统性地评估CUA面临的企业OS安全威胁。我们基于8个基础LLM评估了现有的五大主流CUA,包括ReAct、AutoGPT、Gemini CLI、Cursor CLI及Cursor IDE。结果表明,当前前沿的CUA并未充分覆盖以OS安全为核心的威胁。CUA的这些能力降低了对定制恶意软件和深度领域专业知识的依赖,使得即便是经验不足的攻击者也能发起复杂的企业入侵,这引发了社会对CUA责任与安全性的广泛关注。
English
Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.
PDF32October 9, 2025