ChatPaper.aiChatPaper

MolmoAct2:面向现实世界部署的行动推理模型

MolmoAct2: Action Reasoning Models for Real-world Deployment

May 4, 2026
作者: Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Ren, Ali Farhadi, Dieter Fox, Ranjay Krishna
cs.AI

摘要

视觉-语言-动作(VLA)模型致力于为机器人提供统一通用控制器,但现有系统在现实部署的关键指标上仍存不足:前沿模型多为闭源,开源替代方案受限于昂贵硬件,增强推理策略因环境感知产生过高延迟,微调后的成功率仍低于可靠应用阈值。我们推出全新升级的MolmoAct2——专为实际部署设计的全开放动作推理模型,在五大维度实现突破。我们提出MolmoER这一专精空间与具身推理的VLM骨干网络,通过"专精-演练"训练法在330万样本集上完成训练。发布覆盖低中成本平台的三大新数据集,包括720小时遥操作双臂轨迹数据集MolmoAct2-BimanualYAM(迄今最大开源双臂数据集),以及经过质量筛选的Franka(DROID)和SO100/101子集。推出OpenFAST开源动作分词器,基于五种实体平台的数百万轨迹训练而成。我们重构模型架构,通过逐层KV缓存条件化技术,将流匹配连续动作专家嫁接至离散令牌VLM。最后提出MolmoThink自适应深度推理变体,仅针对时序间变化的场景区域重预测深度令牌,以极低延迟保持几何感知能力。在迄今最全面的开源VLA实证研究中(涵盖7项仿真与真实场景基准),MolmoAct2性能超越Pi-05等强基线,MolmoER在13项具身推理基准上全面超越GPT-5和Gemini Robotics ER-1.5。我们完整开放模型权重、训练代码及全量训练数据。项目主页:https://allenai.org/blog/molmoact2
English
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
PDF1615May 6, 2026