ΔL 정규화: RLVR에서의 손실 집계 방식 재고

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)에서 동적 생성 길이의 특성에 맞춰 설계된 간단하지만 효과적인 손실 집계 방법인 Delta L 정규화를 제안합니다. 최근 RLVR은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 강력한 잠재력을 보여주었지만, 훈련 중 응답 길이의 큰 변동성으로 인해 높은 그래디언트 분산과 불안정한 최적화 문제가 발생하는 주요한 과제가 있습니다. GRPO, DAPO, Dr. GRPO와 같은 기존 방법들은 이 문제를 해결하기 위해 다양한 손실 정규화 항을 도입했지만, 편향된 추정치를 생성하거나 여전히 높은 그래디언트 분산 문제를 겪었습니다. 우리는 이론적 및 실증적으로 다양한 길이가 정책 손실에 미치는 영향을 분석하여 이 문제를 최소 분산 불편 추정량을 찾는 문제로 재구성했습니다. 제안된 Delta L 정규화는 실제 정책 손실에 대한 불편 추정치를 제공할 뿐만 아니라 이론적으로 그래디언트 분산을 최소화합니다. 다양한 모델 크기, 최대 길이 및 작업에 걸친 광범위한 실험을 통해 이 방법이 일관되게 우수한 결과를 달성함을 보여줍니다. 우리의 코드는 https://github.com/zerolllin/Delta-L-Normalization에서 공개될 예정입니다.

English

We propose Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

ΔL 정규화: RLVR에서의 손실 집계 방식 재고

ΔL Normalization: Rethink Loss Aggregation in RLVR

초록

Support