ChatPaper.aiChatPaper

CUA-Suite:面向计算机使用代理的大规模人工标注视频演示数据集 (注:CUA-Suite作为专有名词保留原缩写形式,通过冒号后的解释性翻译明确其核心价值——为计算机使用代理提供大规模人工标注视频演示的数据集。"Massive Human-annotated Video Demonstrations"采用"大规模人工标注视频演示"的译法,既准确传达数据规模与标注性质,又符合中文技术文档表述习惯。"Computer-Use Agents"译为"计算机使用代理",精准对应人机交互领域术语。)

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

March 25, 2026
作者: Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar
cs.AI

摘要

计算机使用智能体(CUAs)在自动化复杂桌面工作流程方面前景广阔,但通用智能体的发展正受限于连续高质量人类演示视频的稀缺性。近期研究强调,连续视频(而非稀疏截图)是扩展这类智能体规模的关键缺失要素。然而现有最大开源数据集ScaleCUA仅包含200万张截图,相当于不足20小时的视频资料。为突破此瓶颈,我们推出CUA-Suite——一个面向专业桌面计算机使用智能体的大规模专家演示视频生态系统,内含密集标注。其核心组件VideoCUA提供涵盖87种多样化应用的约1万项人类演示任务,包含30帧/秒的连续屏幕录制、运动学光标轨迹及多层推理标注,总计约55小时600万帧专家视频。与仅捕捉最终点击坐标的稀疏数据集不同,这些连续视频流完整保留了人机交互的时序动态,构成可无损转换为现有智能体框架所需格式的信息超集。CUA-Suite进一步提供两项互补资源:用于评估CUAs grounding与规划能力的严谨基准测试UI-Vision,以及包含5.6万张标注截图、超360万UI元素标注的大规模定位数据集GroundCUA。初步评估显示,当前基础动作模型在专业桌面应用场景中表现堪忧(任务失败率约60%)。除评估功能外,CUA-Suite丰富的多模态语料库还支持新兴研究方向,包括通用屏幕解析、连续空间控制、基于视频的奖励建模及视觉世界模型等。所有数据与模型均已开源发布。
English
Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.
PDF693March 27, 2026