MolmoAct2: 実世界展開のための行動推論モデル

要旨

ビジョン・ランゲージ・アクション（VLA）モデルは、ロボットのための単一の汎用コントローラを提供することを目指すが、現行のシステムは実世界での展開に重要な基準を満たしていない。最先端モデルはクローズドであり、オープンウェイトの代替案は高価なハードウェアに縛られ、推論強化ポリシーは接地性のために許容不能な遅延を支払い、ファインチューニングされた成功率は信頼できる使用の閾値を下回ったままである。本論文では、実用的な展開のために構築された完全オープンなアクション推論モデルであるMolmoAct2を発表し、その前身モデルから5つの軸に沿って改良を加える。空間的および具現化推論に特化したVLMバックボーンであるMolmoERを導入し、330万サンプルのコーパスと「特化後に反復」というレシピを用いて学習させた。低コストから中コストのプラットフォームにまたがる3つの新しいデータセットを公開する。これには、これまでで最大のオープンな両手動作データセットを構成する720時間のテレオペレーションによる両手軌跡データであるMolmoAct2-BimanualYAMに加え、品質フィルタリングされたFranka（DROID）およびSO100/101のサブセットが含まれる。5つの具現化形態にわたる数百万の軌跡で学習されたオープンウェイト、オープンデータのアクショントークナイザーであるOpenFASTを提供する。レイヤーごとのKVキャッシュ条件付けを介して、フローマッチング連続アクション専門家を離散トークンVLMに接ぎ木するようにアーキテクチャを再設計した。最後に、時間ステップ間で変化するシーン領域のみに対して深度トークンを再予測する適応深度推論変種であるMolmoThinkを提案する。これにより、従来の遅延の一部で幾何学的接地性を維持する。これまでで最も広範なオープンVLAの実証研究において、7つのシミュレーションおよび実世界ベンチマークにわたって、MolmoAct2はPi-05を含む強力なベースラインを上回り、MolmoERは13の具現化推論ベンチマークでGPT-5とGemini Robotics ER-1.5を凌駕した。モデル重み、学習コード、完全な学習データを公開する。プロジェクトページ: https://allenai.org/blog/molmoact2

English

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

MolmoAct2: 実世界展開のための行動推論モデル

MolmoAct2: Action Reasoning Models for Real-world Deployment

要旨

Support