MolmoAct: 空間推論可能な行動推論モデル

要旨

推論は目的志向の行動において中心的な役割を果たすが、ほとんどのロボティクス基盤モデルは知覚と指示を直接制御にマッピングしており、これが適応性、汎化能力、および意味的基盤を制限している。本論文では、Action Reasoning Models (ARMs) を紹介する。これは、知覚、計画、制御を構造化された3段階のパイプラインを通じて統合するビジョン・言語・アクションモデルのクラスである。我々のモデル、MolmoActは、観測と指示を深度を考慮した知覚トークンにエンコードし、編集可能な軌跡トレースとして中間レベルの空間計画を生成し、精密な低レベルのアクションを予測することで、説明可能で操縦可能な行動を実現する。MolmoAct-7B-Dは、シミュレーションと実世界の設定において強力な性能を発揮する：SimplerEnv Visual Matchingタスクにおいて70.5%のゼロショット精度を達成し、クローズドソースのPi-0およびGR00T N1を上回る；LIBEROでは86.6%の平均成功率を記録し、長期的タスクにおいてThinkActに対して6.3%の追加的な向上を示す；また、実世界のファインチューニングでは、Pi-0-FASTに対して単腕で10%、両腕で22.7%のタスク進行度の向上を達成した。さらに、分布外汎化においてベースラインを23.3%上回り、オープンエンドの指示追従と軌跡操縦において最高の人間選好スコアを獲得した。加えて、初めてMolmoAct Datasetを公開する。これは、多様なシナリオとタスクにわたる10,000以上の高品質なロボット軌跡を含む中間トレーニング用ロボットデータセットである。このデータセットを用いたトレーニングにより、ベースモデルに対して平均5.5%の性能向上が得られた。我々は、すべてのモデル重み、トレーニングコード、収集したデータセット、およびアクション推論データセットを公開し、MolmoActを最先端のロボティクス基盤モデルとしてだけでなく、構造化された推論を通じて知覚を目的志向の行動に変換するARMsを構築するためのオープンな青図として確立する。ブログ記事: https://allenai.org/blog/molmoact

English

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

MolmoAct: 空間推論可能な行動推論モデル

MolmoAct: Action Reasoning Models that can Reason in Space

要旨

Support