수학적 LLM을 위한 실용적인 2단계 레시피: SFT를 통한 정확도 극대화와 강화 학습을 통한 효율성 극대화

초록

대규모 언어 모델(LLMs)의 수학적 추론 능력을 향상시키는 것은 AI 역량을 발전시키는 데 있어 핵심적인 과제이다. 지도 미세 조정(SFT)과 강화 학습(RL)이 지배적인 훈련 패러다임이지만, 정확도와 효율성을 모두 극대화하기 위해 이들을 체계적으로 결합하는 방법론은 아직까지 크게 탐구되지 않았다. 본 논문은 확장된 SFT와 온라인 추론을 통한 RL(GRPO)을 전략적으로 통합한 실용적이고 효과적인 훈련 레시피를 소개한다. 우리는 이러한 방법들이 상호 보완적인 역할을 한다고 주장한다: 장기간의 SFT 단계는 먼저 모델의 정확도를 한계까지 끌어올린 후, GRPO 단계는 이 최고 성능을 유지하면서 토큰 효율성을 극적으로 개선한다. 우리의 실험은 성능의 돌파구를 마련하기 위해 SFT를 최대 10 에포크까지 확장하는 것이 중요하며, 이 프레임워크에서 GRPO의 주요 역할은 해결 길이를 최적화하는 것임을 보여준다. 우리의 레시피의 효능은 엄격한 정보 유출 방지가 적용된 AI 수학 올림피아드(AIMO)에서 2,200개 이상의 팀 중 높은 순위를 차지하는 등 도전적인 벤치마크에서의 최상위 성능을 통해 엄격하게 검증되었다. 이 연구는 매우 정확하고 실질적으로 효율적인 최첨단 수학적 추론기를 개발하기 위한 전투 테스트를 거친 청사진을 커뮤니티에 제공한다. 완전한 재현성을 보장하고 미래 연구를 지원하기 위해, 우리는 모든 코드, 모델 체크포인트, 훈련 구성을 포함한 전체 프레임워크를 https://github.com/analokmaus/kaggle-aimo2-fast-math-r1에서 오픈소스로 공개할 예정이다.

English

Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.

수학적 LLM을 위한 실용적인 2단계 레시피: SFT를 통한 정확도 극대화와 강화 학습을 통한 효율성 극대화

A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning

초록

Support