MolmoAct2：面向现实世界部署的行动推理模型

摘要

视觉-语言-动作（VLA）模型致力于为机器人提供统一通用控制器，但现有系统在现实部署的关键指标上仍存不足：前沿模型多为闭源，开源替代方案受限于昂贵硬件，增强推理策略因环境感知产生过高延迟，微调后的成功率仍低于可靠应用阈值。我们推出全新升级的MolmoAct2——专为实际部署设计的全开放动作推理模型，在五大维度实现突破。我们提出MolmoER这一专精空间与具身推理的VLM骨干网络，通过"专精-演练"训练法在330万样本集上完成训练。发布覆盖低中成本平台的三大新数据集，包括720小时遥操作双臂轨迹数据集MolmoAct2-BimanualYAM（迄今最大开源双臂数据集），以及经过质量筛选的Franka（DROID）和SO100/101子集。推出OpenFAST开源动作分词器，基于五种实体平台的数百万轨迹训练而成。我们重构模型架构，通过逐层KV缓存条件化技术，将流匹配连续动作专家嫁接至离散令牌VLM。最后提出MolmoThink自适应深度推理变体，仅针对时序间变化的场景区域重预测深度令牌，以极低延迟保持几何感知能力。在迄今最全面的开源VLA实证研究中（涵盖7项仿真与真实场景基准），MolmoAct2性能超越Pi-05等强基线，MolmoER在13项具身推理基准上全面超越GPT-5和Gemini Robotics ER-1.5。我们完整开放模型权重、训练代码及全量训练数据。项目主页：https://allenai.org/blog/molmoact2

English

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

MolmoAct2：面向现实世界部署的行动推理模型

MolmoAct2: Action Reasoning Models for Real-world Deployment

摘要

Support