MaxProof: 生成検証器強化学習と集団レベルテスト時スケーリングによる数学的証明のスケーリング

要旨

本稿では、MiniMax-M3シリーズにおける競技レベルの数学的証明のための、集団レベルのテスト時スケーリングフレームワークであるMaxProofを提案する。M3はまず、低偽陽性率を実現するよう設計された多層防御型生成検証器を用いて、証明生成、証明検証、批評条件付き証明修正という三つの証明指向の能力を訓練する。これらの能力は、単一のM3モデルとしてリリースされる形に統合される。テスト時には、MaxProofはモデルを生成器、検証器、精緻化器、ランク付け器として扱い、候補となる証明の集団を探索し、トーナメント選択を通じて最終的な一つの証明を返す。MaxProofのテスト時スケーリングにより、M3モデルはIMO 2025で35/42、USAMO 2026で36/42を達成し、両方で人間の金メダル基準を上回った。

English

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.