ChatPaper.aiChatPaper

实用双阶段数学大语言模型构建指南:通过监督微调提升准确性,结合强化学习优化效率

A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning

July 11, 2025
作者: Hiroshi Yoshihara, Taiki Yamaguchi, Yuichi Inoue
cs.AI

摘要

提升大语言模型(LLMs)的数学推理能力是推动人工智能发展的关键挑战。尽管监督微调(SFT)和强化学习(RL)是当前主流的训练范式,但如何系统地将二者结合以同时最大化准确性和效率,仍是一个尚未充分探索的领域。本文提出了一种实用且高效的训练方案,该方案策略性地将扩展的SFT与基于在线推理的强化学习(GRPO)相结合。我们主张这些方法扮演着互补而非竞争的角色:首先,延长的SFT阶段将模型的准确性推向极限,随后,GRPO阶段在保持这一巅峰性能的同时,显著提升了令牌效率。我们的实验表明,将SFT扩展至多达10个周期对于性能突破至关重要,而GRPO在此框架中的主要作用是优化解答长度。通过在严格防泄漏的AI数学奥林匹克竞赛(AIMO)中,在超过2200支队伍中取得高排名的优异表现,我们严格验证了该方案的有效性。本工作为社区提供了一个经过实战检验的蓝图,用于开发既极其准确又实际高效的顶尖数学推理器。为确保完全可复现性并助力未来研究,我们将在https://github.com/analokmaus/kaggle-aimo2-fast-math-r1开源整个框架,包括所有代码、模型检查点和训练配置。
English
Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.
PDF61July 15, 2025