长训练，短思考：课程学习助力高效推理

摘要

近期在提升大型语言模型（LLMs）推理能力的研究中，引入了显式长度控制作为在保持准确性的同时约束计算成本的手段。然而，现有方法依赖于固定长度的训练预算，未能充分利用学习过程中从探索到压缩的自然进程。在本研究中，我们提出了一种基于课程学习的长度控制推理策略，采用组相对策略优化（GRPO）。该方法从宽松的令牌预算开始，并在训练过程中逐步收紧，鼓励模型首先发现有效的解决策略，随后将其提炼为更简洁的推理轨迹。我们通过一个奖励函数增强了GRPO，该函数平衡了三个信号：任务正确性（通过验证器反馈）、长度效率以及格式遵循（通过结构标签）。在GSM8K、MATH500、SVAMP、大学数学及GSM+数据集上的实验表明，基于课程学习的训练在相同最终预算下持续优于固定预算基线，实现了更高的准确性和显著提升的令牌效率。我们进一步分析了奖励权重和衰减调度设计的影响，证明渐进约束作为训练高效推理模型的强大归纳偏置。我们的代码和检查点已发布于：https://github.com/hammoudhasan/curriculum_grpo。

English

Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.

长训练，短思考：课程学习助力高效推理

Train Long, Think Short: Curriculum Learning for Efficient Reasoning

摘要

Support