長訓練，短思考：課程學習助力高效推理

摘要

近期關於提升大型語言模型（LLMs）推理能力的研究引入了顯式長度控制作為一種在保持準確性的同時約束計算成本的手段。然而，現有方法依賴於固定長度的訓練預算，未能充分利用學習過程中從探索到壓縮的自然進展。在本研究中，我們提出了一種基於課程學習的長度控制推理策略，使用群組相對策略優化（GRPO）。我們的方法從寬鬆的token預算開始，並在訓練過程中逐步收緊，鼓勵模型首先發現有效的解決策略，然後將其提煉成更簡潔的推理軌跡。我們通過一個獎勵函數來增強GRPO，該函數平衡了三個信號：任務正確性（通過驗證器反饋）、長度效率和格式遵循（通過結構標籤）。在GSM8K、MATH500、SVAMP、College Math和GSM+上的實驗表明，基於課程的訓練在相同的最終預算下始終優於固定預算的基線，實現了更高的準確性和顯著提升的token效率。我們進一步消融了獎勵權重和衰減計劃設計的影響，表明漸進約束作為訓練高效推理模型的強大歸納偏置。我們的代碼和檢查點已發佈於：https://github.com/hammoudhasan/curriculum_grpo。

English

Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.

長訓練，短思考：課程學習助力高效推理

Train Long, Think Short: Curriculum Learning for Efficient Reasoning

摘要

Support