CUA-Suite：コンピュータ利用エージェント向け大規模人手注釈付き映像実演データセット

要旨

コンピュータ利用エージェント（CUA）は複雑なデスクトップワークフローの自動化において大きな可能性を秘めているが、汎用エージェントの進展は、継続的で高品質な人間の実演ビデオの不足によってボトルネックとなっている。最近の研究では、まばらなスクリーンショットではなく継続的なビデオが、これらのエージェントをスケールさせるために決定的に不足している要素であることが強調されている。しかし、既存最大のオープンデータセットであるScaleCUAはわずか200万枚のスクリーンショット（ビデオ換算で20時間未満）しか含まれていない。このボトルネックを解消するため、我々は専門家による大規模なデスクトップコンピュータ利用エージェント向けビデオ実演と高密度アノテーションのエコシステムであるCUA-Suiteを提案する。その中核を成すのがVideoCUAであり、87の多様なアプリケーションにわたる約10,000件の人間実演タスクを、30fpsの連続画面録画、キネマティックなカーソル軌跡、多層的な推論アノテーションで提供する。総計約55時間・600万フレームの専門家ビデオから構成される本データセットは、最終的なクリック座標のみを捕捉するまばらなデータセットとは異なり、人間のインタラクションの完全な時間的ダイナミクスを保存し、既存のエージェントフレームワークが要求する形式へのロスレス変換を可能にする情報のスーパーセットを形成する。CUA-Suiteはさらに、CUAのグラウンディングと計画能力を評価する厳密なベンチマークUI-Vision、および5.6万枚の注釈付きスクリーンショットと360万以上のUI要素注釈を備えた大規模グラウンディングデータセットGroundCUAという2つの補完的リソースを提供する。予備評価では、現在の基盤行動モデルが専門的デスクトップアプリケーションにおいて大幅な困難（約60%のタスク失敗率）を示すことが明らかとなった。評価に加えて、CUA-Suiteの豊富なマルチモーダルコーパスは、汎用スクリーン解析、連続的空間制御、ビデオベース報酬モデリング、視覚的世界モデルといった新たな研究方向を支援する。すべてのデータとモデルは公開されている。

English

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

CUA-Suite：コンピュータ利用エージェント向け大規模人手注釈付き映像実演データセット

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

要旨

Support