간단하고 통일된 스케일링을 통한 금메달 수준의 올림피아드 추론 달성

초록

최근 추론 모델의 발전은 장기 수학 및 과학 문제 해결 능력을 크게 향상시켰으며, 여러 시스템이 국제수학올림피아드(IMO) 및 국제물리올림피아드(IPhO) 문제에서 금메달 수준의 성능을 달성하고 있습니다. 본 논문에서는 사후 훈련된 추론 백본을 엄격한 올림피아드 수준의 해결사로 변환하는 간단하고 통합된 방법을 제안합니다. 이 방법은 먼저 역-퍼플렉서티 커리큘럼을 사용한 SFT를 통해 엄격한 증명 탐색 및 자가 점검 행동을 주입한 후, 검증 가능한 보상을 사용한 RL에서 더 세밀한 증명 수준 RL로 진행되는 2단계 RL 파이프라인을 통해 이러한 행동을 확장하고, 마지막으로 테스트 시간 스케일링을 통해 해결 성능을 향상시킵니다. 이 방법을 적용하여 약 34만 개의 8K 미만 토큰 궤적으로 SFT를 수행한 후 200회의 RL 단계를 통해 30B-A3B 백본을 훈련했습니다. 그 결과 생성된 모델인 SU-01은 10만 토큰을 초과하는 궤적을 가진 어려운 문제에 대해 안정적인 추론을 지원하며, IMO 2025/USAMO 2026 및 IPhO 2024/2025를 포함한 수학 및 물리 올림피아드 대회에서 금메달 수준의 성능을 달성합니다. 또한 수학과 물리학을 넘어 과학적 추론의 강력한 일반화 능력을 보여줍니다.

English

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.