DLER：正确实施长度惩罚——通过强化学习激励每令牌承载更多智能

摘要

诸如OpenAI-o1、DeepSeek-R1及Qwen等推理语言模型，通过扩展思维链实现了强劲性能，却常产生冗长输出。最大化每单位标记的智能——即准确度与响应长度之比——仍是一个待解难题。我们重新审视强化学习（RL），采用最简单的长度惩罚——截断——并揭示准确度下降并非源于缺乏复杂惩罚机制，而是由于RL优化不足所致。我们识别出三大挑战：（i）优势估计中的显著偏差，（ii）熵崩溃，以及（iii）稀疏奖励信号。针对这些，我们提出了“正确实施长度惩罚”（DLER），一种结合批次奖励归一化、更高裁剪、动态采样及简单截断长度惩罚的训练方案。DLER实现了准确度与效率的最优平衡，在输出长度减少逾70%的同时，超越了所有先前基准准确度。它还提升了测试时的扩展性：相较于DeepSeek-R1-7B，DLER-7B并行生成多个简洁响应，准确度提升28%，延迟更低。我们进一步引入难度感知DLER，自适应地收紧对较易问题的截断，以获取额外效率提升。此外，我们提出了一种更新选择性合并方法，在保持基准准确度的同时，保留了DLER模型的简洁推理能力，这对于RL训练数据稀缺的场景尤为有用。

English

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

DLER：正确实施长度惩罚——通过强化学习激励每令牌承载更多智能

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

摘要

Support