ChatPaper.aiChatPaper

GDPO:面向多奖励强化学习优化的群体奖励解耦归一化策略优化

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

January 8, 2026
作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

随着语言模型能力日益增强,用户不仅期望其提供准确回答,更要求其在多样化场景中展现符合人类偏好的行为。为实现这一目标,强化学习流程开始引入多个奖励函数,每个函数对应不同偏好,以引导模型达成预期行为。然而近期研究默认在多奖励场景下直接采用组相对策略优化方法,却未检验其适用性。本文指出,直接应用GRPO对不同的轨迹奖励组合进行归一化处理,会导致其坍缩为相同的优势值,从而降低训练信号的分辨率,引发收敛效果欠佳甚至早期训练失败。为此我们提出组奖励解耦归一化策略优化方法,通过解耦单个奖励的归一化过程,更真实地保留其相对差异,实现更精准的多奖励优化,并显著提升训练稳定性。我们在工具调用、数学推理和代码推理三项任务中对比GDPO与GRPO,同时评估正确性指标(准确率、错误率)和约束遵循指标(格式、长度)。在所有实验设置下,GDPO均稳定超越GRPO,证明了其在多奖励强化学习优化中的有效性和泛化能力。
English
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
PDF965January 10, 2026