从被动观察到主动评判：强化学习催生机器人操作的过程推理能力

摘要

精确的过程监控仍是长周期机器人操作领域的核心挑战。当前主要瓶颈在于，基于监督微调范式训练的视频多模态大语言模型仅能作为被动"观察者"识别进行中的事件，而无法评估当前状态与最终任务目标的差距。本文提出PRIMO R1（过程推理诱导监控框架），这一70亿参数架构将视频多模态大语言模型转化为主动"评判者"。我们利用基于结果的强化学习技术，激励模型生成显式的思维链进行进度评估。此外，通过将视频序列明确锚定在初始状态与当前状态图像之间，本架构构建了结构化时序输入。基于提出的PRIMO数据集与基准测试，在多样域内环境及域外仿人机器人场景中的大量实验表明，PRIMO R1实现了最先进性能。量化数据显示，我们的70亿参数模型将专业推理基线的平均绝对误差降低50%，相对720亿参数通用多模态大语言模型实现显著精度提升。PRIMO R1在复杂故障检测任务中展现出强大的零样本泛化能力，在RoboFail基准测试中以67.0%的准确率创下新纪录，较OpenAI o1等闭源模型提升6.0%。

English

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.