GD^2PO: 그룹 동적 보상 분리 정책 최적화를 통한 다중 보상 충돌 완화

초록

대규모 언어 모델(LLM)이 발전함에 따라, 사후 훈련 강화 학습(RL)은 포괄적인 능력을 배양하기 위해 점점 다차원 보상에 의존하고 있다. 이러한 변화는 다양하고 잠재적으로 경쟁하는 목표들을 동시에 최적화할 수 있는 새로운 알고리즘을 요구한다. 이에 대응하여, 기존 방법인 그룹 보상 분리 정책 최적화(GDPO)는 전체 점수를 독립적인 보상 그룹으로 분해한 후 각 그룹 내에서 RL 손실을 별도로 계산한다. 그러나 이 전략은 여전히 다중 보상 충돌 문제에 직면한다. 단일 롤아웃이 특정 보상 차원에서는 양의 이점을, 다른 차원에서는 음의 이점을 초래하여 집계 과정에서 상반된 신호가 서로 상쇄됨으로써 RL 훈련 효율성을 더욱 저해한다. 본 논문에서는 이점이 거의 없는 비효과적 롤아웃을 필터링하여 RL 훈련 효율성을 개선하는 동적 샘플링 정책 최적화(DAPO)에서 영감을 얻어, 그룹 동적 보상 분리 정책 최적화(GD²PO)를 제안한다. 구체적으로, GD²PO는 충돌 인식 필터링 메커니즘을 사용하여 보상 간 불일치가 심한 롤아웃을 마스킹한다. 이러한 마스킹 전략은 충돌하는 신호가 서로 상쇄되는 것을 방지함으로써 효과적인 RL 이점의 크기를 보존 및 강화하여 학습 효율성을 크게 가속화한다. 또한, 쿼리 수준 재가중치 부여를 도입하여 각 쿼리의 전반적인 보상 합의도에 따라 업데이트 강도를 동적으로 조정한다. 도구 호출 및 인간 선호 정렬을 포함한 다양한 다중 보상 시나리오에 대한 실험 결과, GD²PO가 기존 기준 방법들을 일관되고 유의미하게 능가함을 보여준다. 코드는 https://github.com/Qwen-Applications/GD2PO에서 확인할 수 있다.

English

As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.