ΔL归一化：重新思考RLVR中的损失聚合

摘要

我们提出了Delta L归一化方法，这是一种简单而有效的损失聚合策略，专门针对可验证奖励强化学习（RLVR）中动态生成长度的特性设计。近期，RLVR在提升大规模语言模型（LLMs）推理能力方面展现出巨大潜力，但训练过程中响应长度的大幅波动导致梯度方差高、优化不稳定，成为主要挑战。尽管GRPO、DAPO及Dr. GRPO等先前方法引入了不同的损失归一化项以应对此问题，但它们要么产生有偏估计，要么仍面临高梯度方差困扰。通过理论与实证分析长度变化对策略损失的影响，我们将问题重新表述为寻找最小方差无偏估计量。所提出的Delta L归一化不仅提供了真实策略损失的无偏估计，理论上还最小化了梯度方差。大量实验表明，该方法在不同模型规模、最大长度及任务上均能稳定取得优异结果。我们的代码将公开于https://github.com/zerolllin/Delta-L-Normalization。

English

We propose Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

ΔL归一化：重新思考RLVR中的损失聚合

ΔL Normalization: Rethink Loss Aggregation in RLVR

摘要

Support