透過簡單且統一的規模化實現金牌級奧賽推理

摘要

近期推理模型的進展大幅推進了長程數學與科學問題求解的能力，目前已有數個系統在國際數學奧林匹亞（IMO）與國際物理奧林匹亞（IPhO）問題上達到金牌等級的表現。在本文中，我們提出一個簡單且統一的配方，將一個後訓練推理骨架轉化為嚴謹的奧林匹亞等級解題者。該配方首先利用反向困惑度課程進行監督式微調（SFT），以灌輸嚴謹的證明搜尋與自我檢查行為；接著透過一個兩階段強化學習（RL）流程來擴展這些行為，從可驗證獎勵的強化學習進展到更精細的證明層級強化學習；最後再透過測試時擴展來提升解題表現。應用此配方，我們訓練了一個30B-A3B骨架，使用約34萬條長度不超過8K token的軌跡進行監督式微調，接著進行200步強化學習。最終得到的模型SU-01能夠在困難問題上進行穩定的推理，軌跡長度超過10萬token，同時在數學與物理奧林匹亞競賽中達到金牌等級的表現，包括IMO 2025/USAMO 2026與IPhO 2024/2025。此外，該模型也展現出科學推理能力在數學與物理領域之外具有強大的泛化性。

English

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.