机器人多巴胺:面向高精度机器人操作的通用工序奖励建模
Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
December 29, 2025
作者: Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
cs.AI
摘要
将强化学习应用于现实世界机器人技术的主要障碍在于有效奖励函数的设计。尽管近年来基于学习的流程奖励模型展现出前景,但它们普遍受到两个根本性局限的制约:其奖励模型缺乏步骤感知理解能力且依赖单视角感知,导致对细粒度操作进程的评估不可靠;同时其奖励塑造过程缺乏理论依据,容易引发误导策略优化的语义陷阱。为此,我们提出Dopamine-Reward——一种从多视角输入中学习通用步骤感知流程奖励模型的新方法。其核心是我们基于3400+小时数据集训练的通才奖励模型,该模型通过步骤化奖励离散化实现结构化理解,并采用多视角奖励融合突破感知局限。基于Dopamine-Reward,我们进一步构建Dopamine-RL鲁棒策略学习框架,采用具有理论保障的策略不变奖励塑造方法,使智能体能利用密集奖励进行高效自我提升而不改变最优策略,从而从根本上规避语义陷阱。跨越多类仿真与真实任务的实验验证了我们的方法:GRM在奖励评估准确率上达到业界最优水平,基于GRM的Dopamine-RL显著提升策略学习效率。例如当GRM通过单条专家轨迹以一次性适应方式迁移至新任务后,Dopamine-RL仅需150次在线交互(约1小时真实机器人操作)即可将策略成功率从接近零提升至95%,并保持优异的跨任务泛化能力。项目网站:https://robo-dopamine.github.io
English
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io