코드 에이전트는 엔드투엔드 시스템 해커가 될 수 있다: 컴퓨터 사용 에이전트의 실제 위협에 대한 벤치마킹

초록

대형 언어 모델(LLM) 또는 멀티모달 LLM(MLLM)으로 구동되는 컴퓨터 사용 에이전트(CUA) 프레임워크는 소프트웨어 환경 내에서 직접 컨텍스트를 인지하고, 추론하며, 행동할 수 있는 보조자로서 빠르게 성숙해지고 있습니다. 이들의 가장 중요한 응용 분야 중 하나는 운영 체제(OS) 제어입니다. OS 영역의 CUA가 일상적인 운영에 점점 더 깊이 통합됨에 따라, 특히 CUA가 현실적이고 보안과 관련된 공격을 수행하는 데 악용될 수 있는지 여부를 포함한 실제 보안 영향을 검토하는 것이 필수적입니다. 기존 연구는 네 가지 주요 한계를 보여줍니다: 전술, 기법 및 절차(TTP)에 대한 공격자 지식 모델의 부재, 엔드투엔드 킬 체인에 대한 불완전한 커버리지, 다중 호스트 및 암호화된 사용자 자격 증명이 없는 비현실적인 환경, 그리고 LLM-as-a-Judge에 의존하는 신뢰할 수 없는 판단. 이러한 격차를 해결하기 위해, 우리는 MITRE ATT&CK Enterprise Matrix의 실제 TTP와 일치하는 첫 번째 벤치마크인 AdvCUA를 제안합니다. AdvCUA는 140개의 작업(40개의 직접적인 악성 작업, 74개의 TTP 기반 악성 작업, 26개의 엔드투엔드 킬 체인 포함)으로 구성되어 있으며, 다중 호스트 환경 샌드박스에서 하드코딩된 평가를 통해 현실적인 기업 OS 보안 위협 하에서 CUA를 체계적으로 평가합니다. 우리는 ReAct, AutoGPT, Gemini CLI, Cursor CLI, Cursor IDE를 포함한 기존의 5가지 주요 CUA를 8개의 기반 LLM을 기반으로 평가했습니다. 결과는 현재 최첨단 CUA가 OS 보안 중심 위협을 충분히 커버하지 못한다는 것을 보여줍니다. CUA의 이러한 능력은 맞춤형 악성코드와 깊은 도메인 전문 지식에 대한 의존도를 줄여, 심지어 경험이 없는 공격자들도 복잡한 기업 침입을 수행할 수 있게 하여 CUA의 책임과 보안에 대한 사회적 우려를 불러일으킵니다.

English

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

코드 에이전트는 엔드투엔드 시스템 해커가 될 수 있다: 컴퓨터 사용 에이전트의 실제 위협에 대한 벤치마킹

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

초록

Support