DLER: 長さペナルティを正しく適用する - 強化学習によるトークンあたりの知性の向上

要旨

OpenAI-o1、DeepSeek-R1、Qwenなどの推論言語モデルは、拡張された思考の連鎖により高い性能を達成するが、しばしば不必要に長い出力を生成する。トークンあたりの知性（応答の長さに対する精度）を最大化することは、未解決の問題である。本研究では、最も単純な長さペナルティである「切り捨て」を用いた強化学習（RL）を再検討し、精度の低下が洗練されたペナルティの欠如ではなく、不十分なRL最適化に起因することを示す。我々は3つの主要な課題を特定した：(i) アドバンテージ推定における大きなバイアス、(ii) エントロピーの崩壊、(iii) スパースな報酬信号。これらを解決するため、バッチ単位の報酬正規化、高いクリッピング、動的サンプリング、および単純な切り捨て長さペナルティを組み合わせたトレーニング手法「Doing Length pEnalty Right (DLER)」を提案する。DLERは、出力長を70％以上削減しながら、従来のすべてのベースライン精度を上回る、精度と効率のトレードオフにおいて最先端の性能を達成する。また、テスト時のスケーリングも改善し、DeepSeek-R1-7Bと比較して、DLER-7Bは並列で複数の簡潔な応答を生成し、28％高い精度と低いレイテンシを実現する。さらに、容易な質問に対して切り捨てを適応的に強化する「Difficulty-Aware DLER」を導入し、追加の効率向上を図る。また、RLトレーニングデータが不足しているシナリオにおいて有用な、ベースライン精度を維持しながらDLERモデルの簡潔な推論能力を保持する更新選択的マージ手法を提案する。

English

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

DLER: 長さペナルティを正しく適用する - 強化学習によるトークンあたりの知性の向上

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

要旨

Support