DCPO: 동적 클리핑 정책 최적화

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 대규모 언어 모델의 추론 능력을 향상시키기 위한 유망한 프레임워크로 부상하고 있습니다. 그러나 GRPO와 같은 기존 접근 방식은 종종 제로 그래디언트 문제를 겪습니다. 이 문제는 주로 토큰 수준 확률 비율에 대한 고정된 클리핑 경계와 동일한 보상의 표준화로 인해 발생하며, 이는 비효율적인 그래디언트 업데이트와 생성된 응답의 미흡한 활용으로 이어질 수 있습니다. 본 연구에서는 동적 클리핑 정책 최적화(DCPO)를 제안합니다. DCPO는 토큰별 사전 확률을 기반으로 클리핑 경계를 적응적으로 조정하여 토큰 수준 탐색을 강화하는 동적 클리핑 전략과, 누적 학습 단계에 걸쳐 보상을 표준화하여 응답 수준에서 생성된 응답의 효과적인 활용을 개선하는 부드러운 이점 표준화 기법을 도입합니다. DCPO는 네 가지 모델을 기반으로 한 네 가지 벤치마크에서 최첨단 성능을 달성했습니다. 특히, Qwen2.5-Math-7B 모델에서 AIME24 벤치마크에서 탐욕적 디코딩 하에 46.7의 Avg@1과 32번 샘플링 하에 38.8의 Avg@32를 달성하여 DAPO(36.7/31.6)와 GRPO(36.7/32.1)를 모두 능가했습니다. Qwen2.5-14B 기반 AIME25 벤치마크에서 DCPO는 (23.3/19.0)의 성능을 달성하여 GRPO(13.3/10.5)와 DAPO(20.0/15.3)를 능가했습니다. 또한, DCPO는 네 가지 모델에서 GRPO 대비 평균 28%의 비제로 이점 개선을 달성했으며, DAPO 대비 학습 효율성을 두 배로 높였고, GRPO와 DAPO 대비 토큰 클리핑 비율을 크게 줄이면서도 우수한 성능을 달성했습니다. 이러한 결과는 DCPO가 대규모 언어 모델의 강화 학습을 위해 생성된 데이터를 더 효율적으로 활용하는 데 효과적임을 보여줍니다.

English

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

DCPO: 동적 클리핑 정책 최적화

DCPO: Dynamic Clipping Policy Optimization

초록

Support