机器人多巴胺：面向高精度机器人操控的通用流程奖励建模

摘要

强化学习（RL）在现实世界机器人应用中的主要障碍在于有效奖励函数的设计。尽管近年来基于学习的进程奖励模型（PRMs）展现出潜力，但它们常受两个根本性局限制约：其奖励模型缺乏步骤感知理解能力且依赖单视角感知，导致对细粒度操作进程的评估不可靠；其奖励塑造过程理论依据不足，常引发误导策略优化的语义陷阱。为此，我们提出Dopamine-Reward——一种从多视角输入中学习通用步骤感知进程奖励模型的新方法。其核心是通用奖励模型（GRM），该模型基于超过3,400小时数据集训练，通过步骤化奖励离散化实现结构化理解，并利用多视角奖励融合突破感知局限。基于Dopamine-Reward，我们进一步提出Dopamine-RL鲁棒策略学习框架，采用理论完备的策略不变奖励塑造方法，使智能体能利用密集奖励实现高效自我提升而不改变最优策略，从而从根本上规避语义陷阱。跨多种模拟与真实任务的实验验证了本方法的有效性：GRM在奖励评估中达到最先进精度，基于GRM的Dopamine-RL显著提升策略学习效率。例如，当GRM通过单次专家轨迹自适应新任务后，所得奖励模型可使Dopamine-RL仅用150次在线 rollout（约1小时真实机器人交互）将策略成功率从接近零提升至95%，并保持优秀的跨任务泛化能力。项目网站：https://robo-dopamine.github.io

English

The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io

机器人多巴胺：面向高精度机器人操控的通用流程奖励建模

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

摘要

Support