ΔL正規化：RLVRにおける損失集約の再考

要旨

我々は、検証可能な報酬を用いた強化学習（RLVR）における動的な生成長の特性に合わせた、シンプルかつ効果的な損失集約手法であるDelta L Normalizationを提案する。最近、RLVRは大規模言語モデル（LLM）の推論能力を向上させる強い可能性を示しているが、訓練中の応答長の大きな変動が高い勾配分散と不安定な最適化を引き起こすという重大な課題が存在する。これまでにGRPO、DAPO、Dr. GRPOなどの手法がこの問題に対処するために異なる損失正規化項を導入してきたが、それらは偏った推定値を生成するか、依然として高い勾配分散に悩まされている。我々は、理論的かつ実証的に長さの変化が方策損失に及ぼす影響を分析し、この問題を最小分散不偏推定量を見つける問題として再定式化した。提案するDelta L Normalizationは、真の方策損失の不偏推定値を提供するだけでなく、理論的にも勾配分散を最小化する。大規模な実験により、異なるモデルサイズ、最大長、タスクにおいて一貫して優れた結果を達成することが示された。我々のコードはhttps://github.com/zerolllin/Delta-L-Normalizationで公開予定である。

English

We propose Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

ΔL正規化：RLVRにおける損失集約の再考

ΔL Normalization: Rethink Loss Aggregation in RLVR

要旨

Support