DreamDojo：基於大規模人類影片的通用機器人世界模型

摘要

能夠在多元環境中模擬行動結果，將徹底改變規模化通用智能體的開發進程。然而，由於數據覆蓋範圍有限和動作標註稀缺，對這類世界動態進行建模（尤其是靈巧機器人任務）面臨重大挑戰。為此，我們推出 DreamDojo——一個基於 44,000 小時第一視角人類影片學習多樣化互動與精細控制的基礎世界模型。我們的混合數據集涵蓋廣泛的日常場景與多樣化物件技能，是迄今為止用於世界模型預訓練的最大規模影片數據集。為解決動作標註稀缺問題，我們引入連續潛在動作作為統一代理動作，增強未標註影片中的互動知識遷移能力。經過小規模目標機器人數據的後訓練，DreamDojo 展現出對物理規律的深刻理解與精準動作控制能力。我們還設計了蒸餾流水線，將 DreamDojo 加速至 10.81 FPS 的實時推斷速度，並進一步提升上下文一致性。本研究實現了基於生成式世界模型的多項重要應用，包括實時遙操作、策略評估和基於模型的規劃。在多個具挑戰性的分佈外基準測試中的系統性評估，驗證了我們的方法在模擬開放世界、高接觸密度任務方面的顯著優勢，為通用機器人世界模型的發展開闢了新路徑。

English

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

DreamDojo：基於大規模人類影片的通用機器人世界模型

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

摘要

Support