ChatPaper.aiChatPaper

DLER:正确实施长度惩罚——通过强化学习激励每令牌承载更多智能

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

October 16, 2025
作者: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

诸如OpenAI-o1、DeepSeek-R1及Qwen等推理语言模型,通过扩展思维链实现了强劲性能,却常产生冗长输出。最大化每单位标记的智能——即准确度与响应长度之比——仍是一个待解难题。我们重新审视强化学习(RL),采用最简单的长度惩罚——截断——并揭示准确度下降并非源于缺乏复杂惩罚机制,而是由于RL优化不足所致。我们识别出三大挑战:(i)优势估计中的显著偏差,(ii)熵崩溃,以及(iii)稀疏奖励信号。针对这些,我们提出了“正确实施长度惩罚”(DLER),一种结合批次奖励归一化、更高裁剪、动态采样及简单截断长度惩罚的训练方案。DLER实现了准确度与效率的最优平衡,在输出长度减少逾70%的同时,超越了所有先前基准准确度。它还提升了测试时的扩展性:相较于DeepSeek-R1-7B,DLER-7B并行生成多个简洁响应,准确度提升28%,延迟更低。我们进一步引入难度感知DLER,自适应地收紧对较易问题的截断,以获取额外效率提升。此外,我们提出了一种更新选择性合并方法,在保持基准准确度的同时,保留了DLER模型的简洁推理能力,这对于RL训练数据稀缺的场景尤为有用。
English
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
PDF142October 20, 2025