MolmoB0T:大规模模拟实现零样本操控
MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation
March 17, 2026
作者: Abhay Deshpande, Maya Guru, Rose Hendrix, Snehal Jauhri, Ainaz Eftekhar, Rohun Tripathi, Max Argus, Jordi Salvador, Haoquan Fang, Matthew Wallingford, Wilbert Pumacay, Yejin Kim, Quinn Pfeifer, Ying-Chun Lee, Piper Wolters, Omar Rayyan, Mingtong Zhang, Jiafei Duan, Karen Farley, Winson Han, Eli Vanderbilt, Dieter Fox, Ali Farhadi, Georgia Chalvatzaki, Dhruv Shah, Ranjay Krishna
cs.AI
摘要
机器人学习领域的主流观点认为,仅靠仿真模拟是远远不够的;学界普遍认为要实现有效的仿真到现实迁移,至少需要收集部分真实世界数据或进行任务特定微调,以弥合虚拟环境与物理环境之间的差距。我们对此假设提出了挑战。通过使用足够大规模且多样化的模拟合成训练数据,我们证明了无需任何真实数据即可实现零样本现实迁移的可能性,且在静态与移动操作任务中均展现卓越效果。我们推出MolmoBot-Engine——一个完全开源的流程化数据生成管道,可在MolmoSpaces中跨机器人、任务及多样化仿真环境进行程序化数据生成。基于此,我们发布MolmoBot-Data数据集,包含180万条针对铰接物体操作和抓取放置任务的专家示教轨迹。我们训练了三类策略模型:基于Molmo2多帧视觉语言模型并配备流匹配动作头的MolmoBot;为直接对比而复现π_0架构的MolmoBot-Pi0;以及适合边缘部署且支持强化学习微调的轻量级策略MolmoBot-SPOC。我们在两个机器人平台上进行评估:用于桌面操作任务的Franka FR3,以及用于开门、抽屉操作、柜体交互和移动抓取放置的Rainbow Robotics RB-Y1移动机械臂。在未经任何真实世界微调的情况下,我们的策略实现了对未见物体和环境的零样本迁移。在桌面抓取放置任务中,MolmoBot在4种场景的真实世界评估中达到79.2%的成功率,显著优于π_{0.5}模型的39.2%。我们的结果表明,程序化环境生成与多样化铰接资源相结合,能够产生可广泛泛化至现实世界的鲁棒操作策略。技术博客:https://allenai.org/blog/molmobot-robot-manipulation
English
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the π_0 architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming π_{0.5} at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation