수동적 관찰자에서 능동적 비평가로: 강화 학습이 로봇 매니퓰레이션을 위한 과정 추론을 이끌어내다

초록

정확한 공정 감독은 장기간 로봇 매니퓰레이션의 핵심 과제로 남아 있습니다. 주요 병목 현상은 현재의 비디오 MLLM이 지도 미세 조정(SFT) 패러다임 아래에서 주로 훈련되어 최종 작업 목표에 대한 현재 상태를 평가하기보다는 진행 중인 사건을 인식하는 수동적인 '관찰자' 역할을 한다는 점입니다. 본 논문에서는 비디오 MLLM을 능동적인 '비평가'로 전환하는 7B 규모 프레임워크인 PRIMO R1(Process Reasoning Induced Monitoring)을 소개합니다. 우리는 진행 상황 추정을 위한 명시적인 사고 연쇄 생성을 장려하기 위해 결과 기반 강화 학습을 활용합니다. 더불어 우리의 아키텍처는 초기 상태와 현재 상태 이미지 사이에 비디오 시퀀스를 명시적으로 고정함으로써 구조화된 시간적 입력을 구성합니다. 제안된 PRIMO 데이터셋 및 벤치마크를 바탕으로, 다양한 도메인 내 환경과 도메인 외 실제 휴머노이드 시나리오에서의 광범위한 실험을 통해 PRIMO R1이 최첨단 성능을 달성함을 입증합니다. 정량적으로, 우리의 7B 모델은 전용 추론 베이스라인의 평균 절대 오차를 50% 감소시켜 72B 규모의 일반 MLLM 대비 상대적 정확도에서 상당한 향상을 보여줍니다. 또한 PRIMO R1은 어려운 실패 감지 작업에서 강력한 제로샷 일반화 능력을 나타냅니다. 우리는 RoboFail 벤치마크에서 67.0%의 정확도로 OpenAI o1과 같은 폐쇄형 모델을 6.0% 앞서는 최첨단 성능을 확립했습니다.

English

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.