ChatPaper.aiChatPaper

论强化学习训练语言模型的最优推理长度

On the Optimal Reasoning Length for RL-Trained Language Models

February 10, 2026
作者: Daisuke Nohara, Taishi Nakamura, Rio Yokota
cs.AI

摘要

强化学习能显著提升大语言模型的推理能力,但往往会导致思维链输出延长,并增加训练与推理过程中的计算成本。尽管已有长度控制方法被提出,但如何平衡效率与性能的最佳输出长度仍不明确。本研究在Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B两个模型上比较了多种长度控制方法。结果表明,长度惩罚可能阻碍推理能力习得,而对具有先验强推理能力模型进行适当调优的长度控制可提升效率。通过将现有研究扩展至强化学习训练策略,我们识别出两种失效模式:1)长输出会增加发散性,2)短输出会导致思考不足。
English
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
PDF21February 12, 2026