CUA-Suite: 컴퓨터 사용 에이전트를 위한 대규모 인간 주석 비디오 데모

초록

컴퓨터 사용 에이전트(CUA)는 복잡한 데스크톱 워크플로우 자동화에 큰 잠재력을 지니고 있으나, 범용 에이전트 개발의 진전은 지속적이고 고품질의 인간 시연 동영상 데이터 부족으로 인해 병목 현상을 겪고 있습니다. 최근 연구들은 스파스한 스크린샷이 아닌 연속 동영상이 이러한 에이전트의 규모 확장을 위해 결정적으로 부족한 요소임을 강조합니다. 그러나 기존 최대 규모의 오픈 데이터셋인 ScaleCUA는 200만 장의 스크린샷만을 포함하며, 이는 20시간 미만의 동영상에 해당합니다. 이러한 병목 현상을 해결하기 위해, 우리는 전문가급 데스크톱 컴퓨터 사용 에이전트를 위한 대규모 전문가 시연 동영상 및 조밀한 주석 생태계인 CUA-Suite를 소개합니다. 그 핵심은 VideoCUA로, 87가지 다양한 애플리케이션에 걸쳐 약 10,000개의 인간 시연 작업을 30fps의 연속 화면 기록, 운동학적 커서 궤적, 다층적 추론 주석과 함께 제공하며, 총 약 55시간, 600만 프레임의 전문가 동영상으로 구성됩니다. 최종 클릭 좌표만을 포착하는 스파스 데이터셋과 달리, 이러한 연속 동영상 스트림은 인간 상호작용의 완전한 시간적 역학을 보존하여, 기존 에이전트 프레임워크에서 요구하는 형식으로 무손실 변환이 가능한 정보의 상위 집합을 형성합니다. CUA-Suite는 또한 두 가지 상호 보완적인 리소스를 추가로 제공합니다: CUA의 기반 설정 및 계획 능력을 평가하기 위한 엄격한 벤치마크인 UI-Vision, 그리고 5만 6천 장의 주석이 달린 스크린샷과 360만 개 이상의 UI 요소 주석을 포함하는 대규모 기반 설정 데이터셋인 GroundCUA입니다. 예비 평가 결과, 현재의 기반 행동 모델들은 전문가용 데스크톱 애플리케이션에서 상당한 어려움을 겪는 것으로 나타났습니다(약 60% 작업 실패율). 평가를 넘어서, CUA-Suite의 풍부한 다중 모달 코퍼스는 범용 화면 구문 분석, 연속 공간 제어, 동영상 기반 보상 모델링, 시각적 세계 모델 등 새로운 연구 방향을 지원합니다. 모든 데이터와 모델은 공개되었습니다.

English

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

CUA-Suite: 컴퓨터 사용 에이전트를 위한 대규모 인간 주석 비디오 데모

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

초록

Support