CUA-Suite: Umfangreiche, menschlich annotierte Video-Demonstrationen für Computer-Use-Agenten

Zusammenfassung

Computer-Use-Agenten (CUAs) bergen großes Potenzial für die Automatisierung komplexer Desktop-Arbeitsabläufe, doch der Fortschritt hin zu universell einsetzbaren Agenten wird durch den Mangel an kontinuierlichen, hochwertigen Videoaufnahmen menschlicher Demonstrationen gebremst. Jüngste Arbeiten betonen, dass kontinuierliche Videoaufnahmen – nicht vereinzelte Bildschirmfotos – die entscheidende fehlende Komponente für die Skalierung dieser Agenten sind. Der größte bestehende offene Datensatz, ScaleCUA, umfasst jedoch nur 2 Millionen Bildschirmfotos, was weniger als 20 Stunden Videomaterial entspricht. Um diesen Engpass zu beheben, stellen wir CUA-Suite vor, ein großangelegtes Ökosystem aus expertenbasierten Videoaufnahmen und dichten Annotationen für professionelle Desktop-Computer-Use-Agenten. Sein Kernstück ist VideoCUA, das etwa 10.000 von Menschen durchgeführte Aufgaben über 87 verschiedene Anwendungen hinweg bereitstellt – mit kontinuierlichen Bildschirmaufzeichnungen bei 30 fps, kinematischen Mauszeigerspuren und mehrschichtigen Reasoning-Annotationen, was insgesamt etwa 55 Stunden und 6 Millionen Frames an Expertenvideos entspricht. Im Gegensatz zu spärlichen Datensätzen, die nur finale Klickkoordinaten erfassen, bewahren diese kontinuierlichen Videoströme die vollständige zeitliche Dynamik menschlicher Interaktionen und bilden eine Obermenge an Informationen, die verlustfrei in die von bestehenden Agenten-Frameworks benötigten Formate transformiert werden kann. CUA-Suite bietet zudem zwei komplementäre Ressourcen: UI-Vision, einen rigorosen Benchmark zur Bewertung der Grounding- und Planungsfähigkeiten von CUAs, und GroundCUA, einen großangelegten Grounding-Datensatz mit 56.000 annotierten Bildschirmfotos und über 3,6 Millionen Annotationen von UI-Elementen. Erste Evaluierungen zeigen, dass aktuelle Foundation-Action-Modelle erheblich mit professionellen Desktop-Anwendungen kämpfen (ca. 60% Aufgabenfehlerrate). Über die Evaluation hinaus unterstützt der reiche multimodale Korpus von CUA-Suite neu aufkommende Forschungsrichtungen,包括 allgemeine Bildschirmparsung, kontinuierliche räumliche Steuerung, videobasierte Reward-Modellierung und visuelle Weltmodelle. Alle Daten und Modelle werden öffentlich zugänglich gemacht.

English

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

CUA-Suite: Umfangreiche, menschlich annotierte Video-Demonstrationen für Computer-Use-Agenten

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Zusammenfassung

Support