MolmoAct：具備空間推理能力的行動推理模型

摘要

推理是實現目的性行動的核心，然而大多數機器人基礎模型將感知與指令直接映射至控制，這限制了其適應性、泛化能力及語義基礎。我們引入了行動推理模型（Action Reasoning Models, ARMs），這是一類視覺-語言-行動模型，通過結構化的三階段流程整合感知、規劃與控制。我們的模型MolmoAct將觀察與指令編碼為深度感知的標記，生成可編輯的中層空間規劃作為軌跡痕跡，並預測精確的低層行動，從而實現可解釋且可操控的行為。MolmoAct-7B-D在模擬與現實環境中均表現出色：在SimplerEnv視覺匹配任務中達到70.5%的零樣本準確率，超越閉源的Pi-0與GR00T N1；在LIBERO上平均成功率達86.6%，包括在長時程任務上較ThinkAct額外提升6.3%；在現實世界的微調中，單臂任務進展較Pi-0-FAST提升10%，雙臂任務提升22.7%。此外，在分佈外泛化上，它較基準模型額外提升23.3%，並在開放式指令跟隨與軌跡操控上獲得最高的人類偏好評分。更進一步，我們首次發布了MolmoAct數據集——一個包含超過10,000條高質量機器人軌跡的中期訓練數據集，涵蓋多樣場景與任務。使用此數據集訓練，相較基礎模型，整體性能平均提升5.5%。我們公開了所有模型權重、訓練代碼、收集的數據集及行動推理數據集，確立MolmoAct不僅為頂尖的機器人基礎模型，更為構建ARMs提供了一個開放藍圖，通過結構化推理將感知轉化為目的性行動。博客文章：https://allenai.org/blog/molmoact

English

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

MolmoAct：具備空間推理能力的行動推理模型

MolmoAct: Action Reasoning Models that can Reason in Space

摘要

Support