GD^2PO：透過群體動態獎勵解耦的策略優化緩解多重獎勵衝突

摘要

隨著大型語言模型的進步，訓練後的強化學習（RL）日益依賴多維度獎勵來培養全面的能力。這種轉變需要新的演算法，能夠同時優化多樣且可能彼此競爭的目標。為此，現有方法如群組獎勵解耦策略優化（GDPO）將整體分數分解為獨立的獎勵群組，然後在每個群組內分別計算強化學習損失。然而，此策略仍會遇到多重獎勵衝突：單次採樣結果在某些獎勵維度上可能獲得正向優勢，但在其他維度上卻為負向，導致聚合時相反訊號相互抵消，進而阻礙強化學習訓練效率。受動態採樣策略優化（DAPO）啟發—該方法透過過濾掉優勢值接近零的低效採樣結果來提升強化學習訓練效率—我們提出群組動態獎勵解耦策略優化（GD^2PO）。具體而言，GD^2PO 採用衝突感知過濾機制，遮蔽掉遭受嚴重獎勵維度不一致的採樣結果。透過防止衝突訊號相互抵消，此遮蔽策略能保留並增強有效強化學習優勢的幅度，從而顯著加速學習效率。此外，我們引入查詢層級重新加權，根據每個查詢的整體獎勵共識動態調整其更新強度。在包含工具呼叫與人類偏好對齊等各種多獎勵場景的實驗中，GD^2PO 持續且顯著優於現有基準。程式碼已公開於 https://github.com/Qwen-Applications/GD2PO。

English

As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.