DCPO: 動的クリッピングポリシー最適化

要旨

検証可能な報酬からの強化学習（RLVR）は、大規模言語モデルの推論能力を向上させるための有望なフレームワークとして登場した。しかし、GRPOなどの既存のアプローチでは、しばしばゼロ勾配の問題が生じる。この問題は主に、トークンレベルの確率比に対する固定クリッピング範囲と同一報酬の標準化に起因し、効果的な勾配更新の妨げや生成された応答の活用不足を引き起こす可能性がある。本研究では、動的クリッピングポリシー最適化（DCPO）を提案する。DCPOは、トークン固有の事前確率に基づいてクリッピング範囲を適応的に調整する動的クリッピング戦略を導入し、トークンレベルの探索を強化する。さらに、累積トレーニングステップにわたる報酬を標準化するスムーズなアドバンテージ標準化技術を採用し、応答レベルの生成応答の有効活用を改善する。DCPOは、4つの異なるモデルに基づく4つのベンチマークで最先端の性能を達成した。特に、AIME24ベンチマークにおいて、Qwen2.5-Math-7Bモデルで、貪欲デコード下でAvg@1 46.7、32回サンプリング下でAvg@32 38.8を達成し、DAPO（36.7/31.6）とGRPO（36.7/32.1）を上回った。Qwen2.5-14Bに基づくAIME25ベンチマークでは、DCPOは（23.3/19.0）の性能を達成し、GRPO（13.3/10.5）とDAPO（20.0/15.3）を上回った。さらに、DCPOは4つのモデルにおいてGRPOと比較して非ゼロアドバンテージの平均28％の改善を達成し、DAPOと比較してトレーニング効率を2倍に向上させ、GRPOおよびDAPOと比較してトークンクリッピング率を1桁削減しつつ、優れた性能を実現した。これらの結果は、DCPOが大規模言語モデルの強化学習において生成データをより効率的に活用する有効性を強調している。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

DCPO: 動的クリッピングポリシー最適化

DCPO: Dynamic Clipping Policy Optimization

要旨

Support