見て学ぶ：オンラインビデオからコンピュータの使い方を学ぶ

要旨

コンピュータ利用エージェント（CUAs）は、多様で常に変化するアプリケーションや環境に基づいたタスクワークフローを計画する必要があるが、対象アプリケーションにおける大規模で高品質なトレーニングデータの不足が学習を妨げている。既存のデータセットはドメイン固有で静的であり、アノテーションにコストがかかる一方、現在の合成データ生成手法はしばしば単純化されたまたは不整合なタスクデモンストレーションを生成する。これらの制限に対処するため、我々はWatch & Learn（W&L）というフレームワークを導入し、インターネット上で容易に入手可能な人間のデモンストレーションビデオを大規模に実行可能なUI軌跡に変換する。軌跡を直接生成したり、アドホックな推論ヒューリスティックに依存する代わりに、この問題を逆動力学の目的として定式化する：連続する画面状態からユーザーの行動を予測する。この定式化により、手動のエンジニアリングが削減され、学習が容易になり、アプリケーション間でより堅牢に一般化される。具体的には、タスクを意識したビデオ検索を備えた逆動力学ラベリングパイプラインを開発し、生のウェブビデオから53,000以上の高品質な軌跡を生成し、これらの軌跡がCUAsの文脈内デモンストレーションおよび教師ありトレーニングデータとして改善されることを示す。挑戦的なOSWorldベンチマークにおいて、W&Lで抽出されたUI軌跡は、汎用および最先端のフレームワークの文脈内性能を一貫して向上させ、教師ありトレーニング下でのオープンソースモデルの性能をより大きく向上させる。これらの結果は、ウェブスケールの人間のデモモンストレーションビデオが、CUAsの実世界での展開に向けた実用的でスケーラブルな基盤としての可能性を示している。

English

Computer use agents (CUAs) need to plan task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data in the target application. Existing datasets are domain-specific, static, and costly to annotate, while current synthetic data generation methods often yield simplistic or misaligned task demonstrations. To address these limitations, we introduce Watch & Learn (W&L), a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale. Instead of directly generating trajectories or relying on ad hoc reasoning heuristics, we cast the problem as an inverse dynamics objective: predicting the user's action from consecutive screen states. This formulation reduces manual engineering, is easier to learn, and generalizes more robustly across applications. Concretely, we develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos, and demonstrate that these trajectories improve CUAs both as in-context demonstrations and as supervised training data. On the challenging OSWorld benchmark, UI trajectories extracted with W&L consistently enhance both general-purpose and state-of-the-art frameworks in-context, and deliver stronger gains for open-source models under supervised training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.

見て学ぶ：オンラインビデオからコンピュータの使い方を学ぶ

Watch and Learn: Learning to Use Computers from Online Videos

要旨

Support