實用兩階段數學大語言模型配方：透過監督微調最大化準確性並以強化學習提升效率

摘要

提升大型语言模型（LLMs）的数学推理能力，是推动人工智能能力发展的关键挑战。尽管监督微调（SFT）与强化学习（RL）作为主流训练范式，但如何系统地将二者结合以最大化准确性与效率，仍是一个尚未充分探索的领域。本文提出了一种实用且高效的训练方案，该方案策略性地将扩展的SFT与基于在线推理的强化学习（GRPO）相结合。我们主张这些方法扮演着互补而非竞争的角色：首先，通过延长SFT阶段将模型的准确性推向极限，随后，GRPO阶段在保持这一巅峰性能的同时，显著提升了令牌效率。实验表明，将SFT扩展至多达10个周期对于性能突破至关重要，而GRPO在此框架中的主要作用在于优化解答长度。我们的方案效能通过在一系列高难度基准测试中的顶尖表现得到了严格验证，包括在严格防泄漏的人工智能数学奥林匹克（AIMO）竞赛中，从超过2200支队伍中脱颖而出，获得高排名。本工作为社区提供了一个经过实战检验的蓝图，用于开发既异常准确又实际高效的顶尖数学推理器。为确保完全可复现性并赋能未来研究，我们将在https://github.com/analokmaus/kaggle-aimo2-fast-math-r1开源整个框架，包括所有代码、模型检查点及训练配置。

English

Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.

實用兩階段數學大語言模型配方：透過監督微調最大化準確性並以強化學習提升效率

A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning

摘要

Support