ΔL 正規化：重新思考 RLVR 中的損失聚合

摘要

我們提出了Delta L正規化，這是一種簡單而有效的損失聚合方法，專門針對可驗證獎勵強化學習（RLVR）中動態生成長度的特性而設計。近年來，RLVR在提升大型語言模型（LLMs）的推理能力方面展現出巨大潛力，但訓練過程中回應長度的巨大變異性導致了高梯度方差和不穩定的優化，這是一個主要挑戰。儘管先前的方法如GRPO、DAPO和Dr. GRPO引入了不同的損失正規化項來解決這一問題，但它們要么產生有偏估計，要么仍受高梯度方差困擾。通過理論與實證分析變動長度對策略損失的影響，我們將問題重新表述為尋找最小方差無偏估計量。我們提出的Delta L正規化不僅提供了真實策略損失的無偏估計，還在理論上最小化了梯度方差。大量實驗表明，它在不同模型大小、最大長度及任務上均能一致取得優異結果。我們的程式碼將公開於https://github.com/zerolllin/Delta-L-Normalization。

English

We propose Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

ΔL 正規化：重新思考 RLVR 中的損失聚合

ΔL Normalization: Rethink Loss Aggregation in RLVR

摘要

Support