MolmoB0T：大规模模拟实现零样本操控

摘要

机器人学习领域的主流观点认为，仅靠仿真模拟是远远不够的；学界普遍认为要实现有效的仿真到现实迁移，至少需要收集部分真实世界数据或进行任务特定微调，以弥合虚拟环境与物理环境之间的差距。我们对此假设提出了挑战。通过使用足够大规模且多样化的模拟合成训练数据，我们证明了无需任何真实数据即可实现零样本现实迁移的可能性，且在静态与移动操作任务中均展现卓越效果。我们推出MolmoBot-Engine——一个完全开源的流程化数据生成管道，可在MolmoSpaces中跨机器人、任务及多样化仿真环境进行程序化数据生成。基于此，我们发布MolmoBot-Data数据集，包含180万条针对铰接物体操作和抓取放置任务的专家示教轨迹。我们训练了三类策略模型：基于Molmo2多帧视觉语言模型并配备流匹配动作头的MolmoBot；为直接对比而复现π_0架构的MolmoBot-Pi0；以及适合边缘部署且支持强化学习微调的轻量级策略MolmoBot-SPOC。我们在两个机器人平台上进行评估：用于桌面操作任务的Franka FR3，以及用于开门、抽屉操作、柜体交互和移动抓取放置的Rainbow Robotics RB-Y1移动机械臂。在未经任何真实世界微调的情况下，我们的策略实现了对未见物体和环境的零样本迁移。在桌面抓取放置任务中，MolmoBot在4种场景的真实世界评估中达到79.2%的成功率，显著优于π_{0.5}模型的39.2%。我们的结果表明，程序化环境生成与多样化铰接资源相结合，能够产生可广泛泛化至现实世界的鲁棒操作策略。技术博客：https://allenai.org/blog/molmobot-robot-manipulation

English

A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the π_0 architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming π_{0.5} at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation

MolmoB0T：大规模模拟实现零样本操控

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

摘要

Support