DCPO:动态裁剪策略优化
DCPO: Dynamic Clipping Policy Optimization
September 2, 2025
作者: Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的一个有前景的框架。然而,现有方法如GRPO常面临梯度消失的问题。这一问题主要源于对词元级概率比率的固定裁剪界限以及对相同奖励的标准化处理,这可能导致梯度更新失效及生成响应的利用不足。本研究提出了动态裁剪策略优化(DCPO),它引入了一种动态裁剪策略,该策略根据词元特定的先验概率自适应调整裁剪界限,以增强词元级探索;同时采用平滑优势标准化技术,跨累积训练步骤标准化奖励,以提高响应级别上生成响应的有效利用率。DCPO在基于四种不同模型的四个基准测试中均取得了最先进的性能。特别是在AIME24基准测试中,DCPO在贪婪解码下实现了46.7的Avg@1,在32次采样下实现了38.8的Avg@32,在Qwen2.5-Math-7B模型上超越了DAPO(36.7/31.6)和GRPO(36.7/32.1)。在基于Qwen2.5-14B的AIME25基准测试中,DCPO表现达到(23.3/19.0),优于GRPO(13.3/10.5)和DAPO(20.0/15.3)。此外,DCPO在四个模型上相较于GRPO实现了非零优势平均28%的提升,训练效率较DAPO翻倍,且与GRPO和DAPO相比,显著降低了词元裁剪比率一个数量级,同时保持了卓越的性能。这些成果凸显了DCPO在更高效利用生成数据进行大语言模型强化学习方面的有效性。
English
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a
promising framework for enhancing the reasoning capabilities of large language
models. However, existing approaches such as GRPO often suffer from zero
gradients. This problem arises primarily due to fixed clipping bounds for
token-level probability ratios and the standardization of identical rewards,
which can lead to ineffective gradient updates and underutilization of
generated responses. In this work, we propose Dynamic Clipping Policy
Optimization (DCPO), which introduces a dynamic clipping strategy that
adaptively adjusts the clipping bounds based on token-specific prior
probabilities to enhance token-level exploration, and a smooth advantage
standardization technique that standardizes rewards across cumulative training
steps to improve the response-level effective utilization of generated
responses. DCPO achieved state-of-the-art performance on four benchmarks based
on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under
greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24
benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the
Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO
achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO
(20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the
nonzero advantage over GRPO in four models, doubled the training efficiency
over DAPO, and significantly reduced the token clipping ratio by an order of
magnitude compared to both GRPO and DAPO, while achieving superior performance.
These results highlight DCPO's effectiveness in leveraging generated data more
efficiently for reinforcement learning in large language models.