DreamDojo：大規模人間動画から構築した汎用ロボット世界モデル

要旨

多様な環境における行動結果をシミュレートできる能力は、汎用エージェントの大規模開発に革命をもたらすでしょう。しかし、特に巧緻性を要するロボティクス課題において、世界の力学をモデル化することは、データカバレッジの限界と行動ラベルの不足により重大な課題となっています。この目標に向けた取り組みとして、私たちは44,000時間のエゴセントリック人間動画から多様なインタラクションと巧緻的な制御を学習する基盤世界モデル「DreamDojo」を提案します。私たちのデータ混合は、日常の多様なシナリオとスキルを網羅した、世界モデル事前学習向けとしては現在最大規模の動画データセットを構成しています。行動ラベルの不足に対処するため、連続潜在行動を統一プロキシ行動として導入し、ラベルなし動画からのインタラクション知識転移を強化しました。小規模なターゲットロボットデータでの事後学習後、DreamDojoは物理法則への深い理解と精密な行動制御性を発揮します。さらに、DreamDojoを10.81 FPSのリアルタイム速度に高速化し、文脈一貫性をさらに向上させる蒸留パイプラインも開発しました。私たちの研究は、生成世界モデルに基づく複数の重要応用（ライブ遠隔操作、方策評価、モデルベース計画立案）を可能にします。複数の困難な分布外ベンチマークにおける体系的な評価は、開放世界の接触豊富な課題をシミュレートする当手法の重要性を実証し、汎用ロボット世界モデルへの道を開くものです。

English

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

DreamDojo：大規模人間動画から構築した汎用ロボット世界モデル

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

要旨

Support