Goedel-Prover-V2: 足場データ合成と自己修正による形式的定理証明のスケーリング

要旨

Goedel-Prover-V2を紹介する。これは、自動定理証明において新たな最先端を確立する一連のオープンソース言語モデルである。標準的な専門家反復と強化学習のパイプラインに基づいて構築された本アプローチは、以下の3つの主要な革新を組み込んでいる：(1) 足場付きデータ合成：モデルがより複雑な定理を習得できるよう、難易度を段階的に上げた合成タスクを生成する。(2) 検証器による自己修正：Leanコンパイラからのフィードバックを活用し、モデルが証明を反復的に修正できるようにする。(3) モデル平均化：トレーニングの後期段階でモデル出力の多様性が低下するのを防ぐため、モデルのチェックポイントを統合する。我々の小型モデル、Goedel-Prover-V2-8Bは、MiniF2Fにおいて84.6%のpass@32を達成し、同じ指標でDeepSeek-Prover-V2-671Bを上回り、そのサイズは80分の1である。我々の主力モデル、Goedel-Prover-V2-32Bは、MiniF2Fにおいて標準モードで88.1%、自己修正モードで90.4%のpass@32を達成し、従来のSOTAを大幅に上回る。さらに、主力モデルはPutnamBenchにおいて184回の試行で86問を解決し、オープンソースモデルのリーダーボードで首位を獲得し、DeepSeek-Prover-V2-671Bが1024回の試行で47問を解決した記録を、大幅に小さいモデルサイズと計算予算で上回った。リリース時（2025年7月～8月）において、Goedel-Prover-V2は全てのオープンソース定理証明器の中で最も強力な総合性能を達成した。また、公開された性能が報告されているクローズドソースシステムを含むトップパフォーマンスモデルの中でも、制約付きテストタイム計算予算下で上位に位置する。我々のモデル、コード、データはhttps://github.com/Goedel-LM/Goedel-Prover-V2で公開されている。

English

We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

Goedel-Prover-V2: 足場データ合成と自己修正による形式的定理証明のスケーリング

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

要旨

Support