MolmoAct2：適用於真實世界部署的行動推理模型

摘要

視覺語言動作模型旨在為機器人提供單一的通用控制器，但現有系統在實際部署的關鍵指標上仍存在不足。前沿模型多為閉源系統，開源權重方案則受限于昂貴硬體，強化推理策略需為基礎功能付出過高延遲代價，而微調後的成功率仍低於可靠應用的閾值。我們提出MolmoAct2——專為實際部署設計的完全開放式動作推理模型，從五個維度實現了對前代產品的革新。我們引入MolmoER這一專注空間與具身推理的視覺語言模型骨幹，通過「專精訓練-泛化演練」方法在330萬樣本數據集上完成訓練。我們發布三個涵蓋中低成本的平台新數據集，包括迄今最大開源雙臂數據集MolmoAct2-BimanualYAM（720小時遙控雙臂軌跡），以及經過質量篩選的Franka（DROID）與SO100/101子集。我們推出OpenFAST開源權重動作標記器，基於五種實體平台的數百萬軌跡數據訓練而成。通過逐層KV快照條件化技術，我們重構架構將流匹配連續動作專家模型嫁接至離散標記視覺語言模型。最後提出MolmoThink自適應深度推理變體，僅針對時序間變化的場景區域重新預測深度標記，以極低延遲保持幾何基礎。在迄今最全面的開源視覺語言動作模型實證研究中，MolmoAct2在7個仿真與現實基準測試中超越包括Pi-05在內的強基線模型，而MolmoER更在13項具身推理基準上勝過GPT-5與Gemini Robotics ER-1.5。我們公開模型權重、訓練代碼及完整訓練數據。項目頁面：https://allenai.org/blog/molmoact2

English

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

MolmoAct2：適用於真實世界部署的行動推理模型

MolmoAct2: Action Reasoning Models for Real-world Deployment

摘要

Support