ChatPaper.aiChatPaper

DLER:正确实施长度惩罚——通过强化学习激励每个Token蕴含更多智能

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

October 16, 2025
作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

诸如OpenAI-o1、DeepSeek-R1和Qwen等推理语言模型通过扩展思维链实现了强劲的性能,但往往生成不必要的冗长输出。最大化每单位token的智能——即准确性与响应长度之比——仍是一个待解决的难题。我们重新审视了强化学习(RL),采用最简单的长度惩罚——截断——并证明准确性的下降并非源于缺乏复杂的惩罚机制,而是由于RL优化不足所致。我们识别出三大关键挑战:(i)优势估计中的显著偏差,(ii)熵的崩溃,以及(iii)稀疏的奖励信号。针对这些问题,我们提出了“正确实施长度惩罚”(DLER),这一训练方案结合了批次奖励归一化、更高截断值、动态采样及简单的截断长度惩罚。DLER在准确性与效率的权衡上达到了业界领先水平,将输出长度削减超过70%,同时超越了所有先前基线的准确性。它还提升了测试时的扩展性:与DeepSeek-R1-7B相比,DLER-7B能并行生成多个简洁响应,准确率提高28%,且延迟更低。我们进一步引入了难度感知的DLER,它根据问题难度自适应地收紧截断,以实现额外的效率提升。此外,我们提出了一种更新选择性合并方法,在保持基线准确性的同时,保留了DLER模型的简洁推理能力,这对于RL训练数据稀缺的场景尤为有用。
English
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
PDF142October 20, 2025