CUA-Suite:面向計算機使用代理的大規模人工標註影片示範數據集
CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
March 25, 2026
作者: Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar
cs.AI
摘要
電腦使用代理(CUA)在自動化複雜桌面工作流程方面前景廣闊,但通用型代理的發展正面臨高質量連續人類示範影片稀缺的瓶頸。近期研究強調,連續影片(而非稀疏截圖)是擴展這類代理規模的關鍵缺失要素。然而現有最大開源資料集ScaleCUA僅包含200萬張截圖,相當於不足20小時的影片時長。為突破此瓶頸,我們推出CUA-Suite——一個針對專業桌面電腦使用代理的大規模專家示範影片生態系與密集標註集。其核心VideoCUA提供涵蓋87種多樣應用的約1萬項人類示範任務,包含30 fps連續螢幕錄影、動態游標軌跡及多層次推理標註,總計約55小時600萬幀專家級影片。有別於僅記錄最終點擊座標的稀疏資料集,這些連續影片流完整保留了人類互動的時序動態,構成可無損轉換為現有代理框架所需格式的資訊超集。CUA-Suite進一步提供兩項互補資源:用於評估CUA基礎能力與規劃效能的嚴謹基準測試UI-Vision,以及包含5.6萬張標註截圖、逾360萬個UI元件標註的大規模基礎資料集GroundCUA。初步評估顯示,現有基礎動作模型在專業桌面應用中表現嚴重不足(任務失敗率約60%)。除評估功能外,CUA-Suite豐富的多模態語料庫更支援新興研究方向,包括通用型螢幕解析、連續空間控制、基於影片的獎勵建模及視覺世界模型。所有資料與模型均已開源釋出。
English
Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.