从被动观察到主动评判：强化学习激发机器人操作的过程推理能力

摘要

精準的流程監控仍是長時程機器人操作的關鍵挑戰。當前主要瓶頸在於，基於監督微調範式訓練的視頻多模態大語言模型僅能作為被動「觀察者」識別進行中的事件，而非根據最終任務目標評估當前狀態。本文提出PRIMO R1（流程推理誘導監控）框架，這個70億參數的系統將視頻多模態大語言模型轉化為主動「批判者」。我們採用基於結果的強化學習策略，激發模型生成明確的思維鏈進行進度評估。此外，通過在初始狀態與當前狀態圖像間建立明確錨點，我們的架構構建了結構化時序輸入。基於提出的PRIMO數據集與基準測試，在多樣化域內環境及域外真實人形機器人場景中的廣泛實驗表明，PRIMO R1實現了最先進的性能：量化數據顯示，我們的70億參數模型將專用推理基線的平均絕對誤差降低50%，相對準確度顯著優於720億參數的通用多模態大語言模型。同時，PRIMO R1在困難故障檢測任務中展現出強大的零樣本泛化能力，於RoboFail基準測試中以67.0%的準確率創下新紀錄，較OpenAI o1等閉源模型高出6.0%。

English

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.