MolmoAct：具备空间推理能力的动作推理模型

摘要

推理是目标导向行动的核心，然而大多数机器人基础模型直接将感知和指令映射到控制，这限制了适应性、泛化能力和语义基础。我们引入了动作推理模型（ARMs），这是一类通过结构化三阶段流程整合感知、规划与控制的视觉-语言-动作模型。我们的模型MolmoAct将观察和指令编码为深度感知的感知标记，生成可编辑的轨迹痕迹作为中层空间规划，并预测精确的低层动作，从而实现可解释且可引导的行为。MolmoAct-7B-D在仿真和现实环境中均表现出色：在SimplerEnv视觉匹配任务中达到70.5%的零样本准确率，超越闭源的Pi-0和GR00T N1；在LIBERO上平均成功率为86.6%，包括在长时任务上较ThinkAct额外提升6.3%；在现实世界微调中，单臂任务进展较Pi-0-FAST提升10%，双臂任务提升22.7%。在分布外泛化上，它比基线模型额外提升23.3%，并在开放式指令跟随和轨迹引导上获得最高的人类偏好评分。此外，我们首次发布了MolmoAct数据集——一个包含10,000多条高质量机器人轨迹的中期训练数据集，涵盖多种场景和任务。使用该数据集训练使基础模型的整体性能平均提升5.5%。我们公开了所有模型权重、训练代码、收集的数据集以及动作推理数据集，确立了MolmoAct作为最先进的机器人基础模型，并通过结构化推理将感知转化为目标导向行动的开放蓝图。博客文章：https://allenai.org/blog/molmoact

English

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

MolmoAct：具备空间推理能力的动作推理模型

MolmoAct: Action Reasoning Models that can Reason in Space

摘要

Support