DCPO：動態剪裁策略優化

摘要

基於可驗證獎勵的強化學習（RLVR）已成為提升大型語言模型推理能力的一種前景廣闊的框架。然而，現有方法如GRPO常面臨梯度為零的問題。這一問題主要源於對詞元級概率比率的固定裁剪界限以及對相同獎勵的標準化處理，這可能導致梯度更新無效及生成回應的利用不足。本研究提出動態裁剪策略優化（DCPO），引入了一種基於詞元特定先驗概率自適應調整裁剪界限的動態裁剪策略，以增強詞元級探索；並採用平滑優勢標準化技術，跨累積訓練步驟標準化獎勵，以提高回應層面對生成回應的有效利用。DCPO在基於四種不同模型的四個基準測試中均達到了最先進的性能。特別地，在AIME24基準上，DCPO在貪婪解碼下取得了46.7的Avg@1，在32次採樣下取得了38.8的Avg@32，超越了Qwen2.5-Math-7B模型上的DAPO（36.7/31.6）和GRPO（36.7/32.1）。在基於Qwen2.5-14B的AIME25基準上，DCPO表現為（23.3/19.0），優於GRPO（13.3/10.5）和DAPO（20.0/15.3）。此外，DCPO在四種模型上相較GRPO平均提升了28%的非零優勢，訓練效率相較DAPO翻倍，並顯著降低了詞元裁剪比率，相比GRPO和DAPO均減少了一個數量級，同時實現了更優的性能。這些結果凸顯了DCPO在更高效利用生成數據進行大型語言模型強化學習方面的有效性。

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

DCPO：動態剪裁策略優化

DCPO: Dynamic Clipping Policy Optimization

摘要

Support