DCPO:動態剪裁策略優化
DCPO: Dynamic Clipping Policy Optimization
September 2, 2025
作者: Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型推理能力的一種前景廣闊的框架。然而,現有方法如GRPO常面臨梯度為零的問題。這一問題主要源於對詞元級概率比率的固定裁剪界限以及對相同獎勵的標準化處理,這可能導致梯度更新無效及生成回應的利用不足。本研究提出動態裁剪策略優化(DCPO),引入了一種基於詞元特定先驗概率自適應調整裁剪界限的動態裁剪策略,以增強詞元級探索;並採用平滑優勢標準化技術,跨累積訓練步驟標準化獎勵,以提高回應層面對生成回應的有效利用。DCPO在基於四種不同模型的四個基準測試中均達到了最先進的性能。特別地,在AIME24基準上,DCPO在貪婪解碼下取得了46.7的Avg@1,在32次採樣下取得了38.8的Avg@32,超越了Qwen2.5-Math-7B模型上的DAPO(36.7/31.6)和GRPO(36.7/32.1)。在基於Qwen2.5-14B的AIME25基準上,DCPO表現為(23.3/19.0),優於GRPO(13.3/10.5)和DAPO(20.0/15.3)。此外,DCPO在四種模型上相較GRPO平均提升了28%的非零優勢,訓練效率相較DAPO翻倍,並顯著降低了詞元裁剪比率,相比GRPO和DAPO均減少了一個數量級,同時實現了更優的性能。這些結果凸顯了DCPO在更高效利用生成數據進行大型語言模型強化學習方面的有效性。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a
promising framework for enhancing the reasoning capabilities of large language
models. However, existing approaches such as GRPO often suffer from zero
gradients. This problem arises primarily due to fixed clipping bounds for
token-level probability ratios and the standardization of identical rewards,
which can lead to ineffective gradient updates and underutilization of
generated responses. In this work, we propose Dynamic Clipping Policy
Optimization (DCPO), which introduces a dynamic clipping strategy that
adaptively adjusts the clipping bounds based on token-specific prior
probabilities to enhance token-level exploration, and a smooth advantage
standardization technique that standardizes rewards across cumulative training
steps to improve the response-level effective utilization of generated
responses. DCPO achieved state-of-the-art performance on four benchmarks based
on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under
greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24
benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the
Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO
achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO
(20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the
nonzero advantage over GRPO in four models, doubled the training efficiency
over DAPO, and significantly reduced the token clipping ratio by an order of
magnitude compared to both GRPO and DAPO, while achieving superior performance.
These results highlight DCPO's effectiveness in leveraging generated data more
efficiently for reinforcement learning in large language models.