MaxProof: 생성-검증 RL과 집단 수준 테스트 시간 스케일링을 통한 수학적 증명 확장

초록

본 논문에서는 경쟁 수준의 수학적 증명을 위한 집단 수준 테스트 시간 스케일링 프레임워크인 MaxProof를 MiniMax-M3 시리즈에서 제시한다. M3는 먼저 증명 생성, 증명 검증, 비평 기반 증명 수정이라는 세 가지 증명 중심 능력을 훈련하며, 낮은 거짓 양성률을 위해 설계된 심층 방어 생성형 검증기(verifier)를 활용한다. 이러한 능력들은 단일 공개 M3 모델로 통합된다. 테스트 시점에서 MaxProof는 모델을 생성기, 검증기, 정제기, 순위 매기기 도구로 취급하며, 후보 증명 집단을 탐색하고 토너먼트 선택을 통해 최종 증명 하나를 반환한다. MaxProof 테스트 시간 스케일링을 통해 M3 모델은 IMO 2025에서 35/42, USAMO 2026에서 36/42에 도달하여 두 대회 모두 인간 금메달 기준을 초과한다.

English

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.