GD^2PO: グループ動的報酬分離型方策最適化による複数報酬間の競合緩和

要旨

LLMの進展に伴い、事後学習における強化学習（RL）は、総合的な能力を育成するために、多次元の報酬に依存することが増えている。この変化には、多様で潜在的に競合する目的を同時に最適化できる新しいアルゴリズムが必要である。この課題に対処するため、Group reward-Decoupled Policy Optimization（GDPO）などの既存手法では、全体スコアを独立した報酬グループに分解し、各グループ内でRL損失を個別に計算する。しかし、この戦略でも複数報酬間の競合が依然として発生する。すなわち、単一のロールアウトが特定の報酬次元では正のアドバンテージを生む一方、他の次元では負のアドバンテージを生じ、集約時に反対のシグナルが互いに打ち消し合い、RL学習効率をさらに阻害する。ゼロに近いアドバンテージを持つ非効果的なロールアウトをフィルタリングすることでRL学習効率を向上させるDynamic sAmpling Policy Optimization（DAPO）に着想を得て、我々はGroup-Dynamic reward-Decoupled Policy Optimization（GD^2PO）を提案する。具体的には、GD^2POは競合検知フィルタリング機構を採用し、報酬次元間で深刻な不一致が生じているロールアウトをマスキングする。競合するシグナルが互いに打ち消し合うのを防ぐことで、このマスキング戦略は効果的なRLアドバンテージの大きさを保存・増強し、学習効率を大幅に加速する。さらに、クエリレベルの再重み付けを導入し、各クエリの全体的な報酬コンセンサスに基づいて更新強度を動的に調整する。ツール呼び出しや人間の選好アライメントを含む様々な多次元報酬シナリオでの実験により、GD^2POが既存のベースラインを一貫して有意に上回ることを実証した。コードはhttps://github.com/Qwen-Applications/GD2POで公開されている。

English

As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.