关于强化学习训练语言模型的最优推理长度研究
On the Optimal Reasoning Length for RL-Trained Language Models
February 10, 2026
作者: Daisuke Nohara, Taishi Nakamura, Rio Yokota
cs.AI
摘要
強化學習能顯著提升大語言模型的推理能力,但同時也會延長思維鏈的輸出長度,增加訓練與推論階段的計算成本。儘管已有學者提出長度控制方法,但如何平衡效率與性能的最佳輸出長度仍不明確。本研究在Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B兩個模型上比較了多種長度控制方法。結果表明,長度懲罰機制可能阻礙推理能力的習得,而經過適當調校的長度控制能提升具備先驗推理能力模型的效率。通過將既有研究擴展至強化學習訓練的策略,我們發現兩種失效模式:1)過長輸出會增加答案離散度,2)過短輸出導致思考不足。
English
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.