DLER:正确实施长度惩罚——通过强化学习激励每个Token蕴含更多智能
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
October 16, 2025
作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
cs.AI
摘要
诸如OpenAI-o1、DeepSeek-R1和Qwen等推理语言模型通过扩展思维链实现了强劲的性能,但往往生成不必要的冗长输出。最大化每单位token的智能——即准确性与响应长度之比——仍是一个待解决的难题。我们重新审视了强化学习(RL),采用最简单的长度惩罚——截断——并证明准确性的下降并非源于缺乏复杂的惩罚机制,而是由于RL优化不足所致。我们识别出三大关键挑战:(i)优势估计中的显著偏差,(ii)熵的崩溃,以及(iii)稀疏的奖励信号。针对这些问题,我们提出了“正确实施长度惩罚”(DLER),这一训练方案结合了批次奖励归一化、更高截断值、动态采样及简单的截断长度惩罚。DLER在准确性与效率的权衡上达到了业界领先水平,将输出长度削减超过70%,同时超越了所有先前基线的准确性。它还提升了测试时的扩展性:与DeepSeek-R1-7B相比,DLER-7B能并行生成多个简洁响应,准确率提高28%,且延迟更低。我们进一步引入了难度感知的DLER,它根据问题难度自适应地收紧截断,以实现额外的效率提升。此外,我们提出了一种更新选择性合并方法,在保持基线准确性的同时,保留了DLER模型的简洁推理能力,这对于RL训练数据稀缺的场景尤为有用。
English
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve
strong performance via extended chains of thought but often generate
unnecessarily long outputs. Maximizing intelligence per token--accuracy
relative to response length--remains an open problem. We revisit reinforcement
learning (RL) with the simplest length penalty--truncation--and show that
accuracy degradation arises not from the lack of sophisticated penalties but
from inadequate RL optimization. We identify three key challenges: (i) large
bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward
signal. We address them with Doing Length pEnalty Right (DLER), a training
recipe combining batch-wise reward normalization, higher clipping, dynamic
sampling, and a simple truncation length penalty. DLER achieves
state-of-the-art accuracy--efficiency trade-offs, cutting output length by over
70 percent while surpassing all previous baseline accuracy. It also improves
test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple
concise responses in parallel with 28 percent higher accuracy and lower
latency. We further introduce Difficulty-Aware DLER, which adaptively tightens
truncation on easier questions for additional efficiency gains. We also propose
an update-selective merging method that preserves baseline accuracy while
retaining the concise reasoning ability of the DLER model, which is useful for
scenarios where RL training data is scarce.