DreamDojo：基于大规模人类视频的通用机器人世界模型

摘要

能够在多样环境中模拟行动结果，将彻底革新通用智能体的大规模开发。然而，由于数据覆盖有限和动作标签稀缺，对这些世界动态进行建模（尤其是灵巧机器人任务）仍面临重大挑战。为此，我们推出DreamDojo——一个基于44,000小时人类第一视角视频学习多样化交互与灵巧控制的基础世界模型。我们的混合数据集是目前世界模型预训练领域规模最大的视频数据集，涵盖包含丰富物体与技能的日常场景。针对动作标签稀缺问题，我们引入连续潜在动作作为统一代理动作，增强从未标注视频中迁移交互知识的能力。经过小规模目标机器人数据的后训练，DreamDojo展现出对物理规律的深刻理解和精准的动作控制能力。我们还设计了蒸馏管道，将模型加速至10.81 FPS的实时运行速度，并进一步提升上下文一致性。本研究实现了基于生成式世界模型的多个重要应用，包括实时遥操作、策略评估和基于模型的规划。在多个具有挑战性的分布外基准测试中的系统化评估，验证了我们的方法在模拟开放世界、密集接触任务方面的重要意义，为通用机器人世界模型的发展开辟了新路径。

English

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

DreamDojo：基于大规模人类视频的通用机器人世界模型

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

摘要

Support