GDPO: マルチ報酬RL最適化のためのグループ報酬分離正規化ポリシー最適化

要旨

言語モデルの能力が高度化するにつれ、ユーザーは正確な応答だけでなく、多様なシナリオにおける人間の好みに沿った挙動も期待するようになっている。これを実現するため、強化学習（RL）パイプラインでは、個別の選好を捉えた複数の報酬を組み込み、モデルを所望の挙動へ導く手法が採用され始めている。しかし近年の研究では、多報酬設定においてGroup Relative Policy Optimization（GRPO）を適用することが常態化しており、その適切性が検証されていない。本論文では、異なるロールアウト報酬の組み合わせにGRPOを直接適用して正規化すると、それらが同一のアドバンテージ値に収束し、訓練信号の分解能が低下して最適解に至らないこと、場合によっては訓練の早期失敗を引き起こすことを示す。次に、これらの課題を解決する新しい方策最適化手法であるGroup reward-Decoupled Normalization Policy Optimization（GDPO）を提案する。本手法は個々の報酬の正規化を分離することで、報酬間の相対的な差異をより忠実に保持し、精度の高い多報酬最適化と訓練安定性の大幅な向上を実現する。GDPOとGRPOを、ツール呼び出し、数学推論、コード推論の3タスクで比較し、正答率指標（精度、バグ率）と制約遵守指標（形式、長さ）の双方で評価した。全ての設定においてGDPOはGRPOを一貫して上回り、多報酬強化学習最適化における有効性と一般性が実証された。

English

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

GDPO: マルチ報酬RL最適化のためのグループ報酬分離正規化ポリシー最適化

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

要旨

Support