MolmoAct: 공간적 추론이 가능한 행동 추론 모델

초록

추론은 목적 지향적 행동의 핵심이지만, 대부분의 로봇 기초 모델은 지각과 명령을 직접 제어로 매핑하여 적응성, 일반화, 그리고 의미적 기반이 제한됩니다. 우리는 Action Reasoning Models(ARMs)를 소개합니다. 이는 구조화된 3단계 파이프라인을 통해 지각, 계획, 그리고 제어를 통합하는 비전-언어-행동 모델 클래스입니다. 우리의 모델인 MolmoAct은 관찰과 명령을 깊이 인식 지각 토큰으로 인코딩하고, 편집 가능한 궤적 흔적으로 중간 수준의 공간 계획을 생성하며, 정밀한 저수준 행동을 예측하여 설명 가능하고 조정 가능한 행동을 가능하게 합니다. MolmoAct-7B-D는 시뮬레이션과 실제 환경에서 강력한 성능을 보입니다: SimplerEnv Visual Matching 작업에서 70.5%의 제로샷 정확도를 달성하여 폐쇄형 Pi-0와 GR00T N1을 능가하며, LIBERO에서 86.6%의 평균 성공률을 기록하고, 장기 작업에서 ThinkAct 대비 추가 6.3%의 성능 향상을 보였습니다. 또한 실제 환경 미세 조정에서 Pi-0-FAST 대비 단일 팔 작업에서 10%, 양팔 작업에서 22.7%의 추가 작업 진행률을 달성했습니다. 분포 외 일반화에서도 기준선 대비 23.3%의 추가 성능 향상을 보였으며, 개방형 명령 수행과 궤적 조정에서 최고의 인간 선호 점수를 기록했습니다. 더불어, 우리는 최초로 MolmoAct 데이터셋을 공개합니다. 이는 다양한 시나리오와 작업에 걸쳐 10,000개 이상의 고품질 로봇 궤적을 포함한 중간 훈련 로봇 데이터셋입니다. 이 데이터셋으로 훈련하면 기본 모델 대비 일반 성능이 평균 5.5% 향상됩니다. 우리는 모든 모델 가중치, 훈련 코드, 수집한 데이터셋, 그리고 행동 추론 데이터셋을 공개하여 MolmoAct을 최첨단 로봇 기초 모델이자 구조화된 추론을 통해 지각을 목적 지향적 행동으로 전환하는 ARMs 구축을 위한 개방형 청사진으로 확립합니다. 블로그 포스트: https://allenai.org/blog/molmoact

English

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

MolmoAct: 공간적 추론이 가능한 행동 추론 모델

MolmoAct: Action Reasoning Models that can Reason in Space

초록

Support