シンプルで統一的なスケーリングによる金メダル級オリンピアード推論の達成

要旨

近年の推論モデルの進歩により、長期的な数学・科学問題解決が大幅に向上し、いくつかのシステムは国際数学オリンピック（IMO）や国際物理オリンピック（IPhO）の問題で金メダルレベルの性能を達成している。本稿では、事後学習された推論バックボーンを、厳格なオリンピアードレベルの解法器に変換するためのシンプルかつ統一的なレシピを紹介する。このレシピは、まず逆パープレキシティカリキュラムを用いたSFTにより、厳密な証明探索と自己チェックの振る舞いを植え付け、次に検証可能な報酬を用いたRLからより繊細な証明レベルのRLへと進む2段階のRLパイプラインを通じてこれらの振る舞いを拡張し、最後にテスト時スケーリングによって解法性能を向上させる。このレシピを適用し、約34万のサブ8Kトークン軌跡に対するSFTとその後の200RLステップを用いて、30B-A3Bバックボーンを訓練した。得られたモデルSU-01は、10万トークンを超える軌跡で難問に対する安定した推論を可能とし、IMO 2025/USAMO 2026やIPhO 2024/2025を含む数学・物理オリンピック競技で金メダルレベルの性能を達成する。また、数学や物理を超えた領域への科学推論の強力な一般化も示す。

English

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.