ChatPaper.aiChatPaper

GDPO:面向多獎勵強化學習優化的群組獎勵解耦歸一化策略優化

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

January 8, 2026
作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

隨著語言模型能力日益強大,使用者不僅期待其提供準確回應,更要求其行為能符合多樣化場景中的人類偏好。為實現此目標,強化學習(RL)流程開始整合多種獎勵信號,每種獎勵分別對應特定偏好,以引導模型達成預期行為。然而近期研究未經審視適用性,便預設在多獎勵設定下採用群組相對策略優化(GRPO)。本文證實直接應用GRPO對不同滾動獎勵組合進行歸一化處理,會導致其坍縮為相同的優勢值,降低訓練信號的解析度,進而引發收斂次優化甚至早期訓練失敗。為此我們提出群組獎勵解耦歸一化策略優化(GDPO),透過解耦個別獎勵的歸一化過程,更真實地保留其相對差異,實現更精準的多獎勵優化,並大幅提升訓練穩定性。我們在工具呼叫、數學推理與程式碼推理三項任務中比較GDPO與GRPO,同時評估正確性指標(準確率、錯誤率)與約束遵循指標(格式、長度)。在所有設定下,GDPO均持續優於GRPO,展現其在多獎勵強化學習優化中的有效性與泛化能力。
English
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
PDF965January 10, 2026