机器人多巴胺:面向高精度机器人操控的通用流程奖励建模
Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
December 29, 2025
作者: Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
cs.AI
摘要
强化学习(RL)在现实世界机器人应用中的主要障碍在于有效奖励函数的设计。尽管近年来基于学习的进程奖励模型(PRMs)展现出潜力,但它们常受两个根本性局限制约:其奖励模型缺乏步骤感知理解能力且依赖单视角感知,导致对细粒度操作进程的评估不可靠;其奖励塑造过程理论依据不足,常引发误导策略优化的语义陷阱。为此,我们提出Dopamine-Reward——一种从多视角输入中学习通用步骤感知进程奖励模型的新方法。其核心是通用奖励模型(GRM),该模型基于超过3,400小时数据集训练,通过步骤化奖励离散化实现结构化理解,并利用多视角奖励融合突破感知局限。基于Dopamine-Reward,我们进一步提出Dopamine-RL鲁棒策略学习框架,采用理论完备的策略不变奖励塑造方法,使智能体能利用密集奖励实现高效自我提升而不改变最优策略,从而从根本上规避语义陷阱。跨多种模拟与真实任务的实验验证了本方法的有效性:GRM在奖励评估中达到最先进精度,基于GRM的Dopamine-RL显著提升策略学习效率。例如,当GRM通过单次专家轨迹自适应新任务后,所得奖励模型可使Dopamine-RL仅用150次在线 rollout(约1小时真实机器人交互)将策略成功率从接近零提升至95%,并保持优秀的跨任务泛化能力。项目网站:https://robo-dopamine.github.io
English
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io