GD^2PO:通过组动态奖励解耦策略优化缓解多奖励冲突
GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
June 15, 2026
作者: Haotian Liu, Yihao Liu, Jingwei Ni, Siyuan Huang, Xinpeng Liu, Pengyu Cheng, Jiajun Song, Ruijin Ding, Junfeng Li, Zhechao Yu, Mengyu Zhou, Hongteng Xu, Xiaoxi Jiang, Guanjun Jiang
cs.AI
摘要
随着大语言模型的发展,后训练强化学习日益依赖多维奖励来培养综合能力。这一转变对能够同时优化多样化且可能相互竞争的目标的新算法提出了需求。为此,现有方法如分组奖励解耦策略优化(GDPO)将整体得分分解为独立的奖励组,然后在每组内分别计算强化学习损失。然而,该策略仍面临多奖励冲突问题:单次轨迹可能在部分奖励维度产生正优势,而在其他维度产生负优势,导致聚合时对立信号相互抵消,进一步阻碍强化学习训练效率。受动态采样策略优化(DAPO)通过滤除近似零优势的低效轨迹来提升训练效率的启发,我们提出分组动态奖励解耦策略优化(GD²PO)。具体而言,GD²PO采用冲突感知过滤机制,屏蔽存在严重奖励不一致的轨迹。通过防止冲突信号相互抵消,该屏蔽策略保留并增强了有效强化学习优势的幅度,从而显著加速学习效率。此外,我们引入查询级重加权,根据查询的整体奖励共识动态调整其更新强度。在工具调用与人类偏好对齐等多奖励场景下的实验表明,GD²PO持续且显著优于现有基线方法。代码已开源至 https://github.com/Qwen-Applications/GD2PO。
English
As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.