Pythagoras-Prover: 拡張Lean形式化による効率的な形式的証明の促進

要旨

最新のLean定理証明系は、検証済み証明データの不足や形式証明探索における長い推論トレースに一部起因して、大規模な学習および推論計算を伴わなければ強力な性能を達成できず、その結果、教師ありファインチューニング（SFT）とサンプリングの両方が高コストになっている。本稿では、実用的な計算予算向けに設計された、計算効率の高いオープンソースのLean定理証明系ファミリーであるPythagoras-Proverを紹介する。このファミリーは、4Bおよび32Bパラメータの自己回帰モデルと、推論時にLean証明を反復的に洗練する初の概念実証としての拡散型証明器（4B）の2つの生成パラダイムにわたる。学習効率の向上のために、カリキュラムSFT用に易・中・難の問題に層別されたLean検証済みコーパスを構築し、モデルが短く簡単な証明から長く難しい証明へと段階的に証明スキルを習得できるようにする。SFT中は、動的証明推論フィルタリング方式により、各インスタンスを8kトークンのコンテキスト予算内に保ちつつ、情報豊富な証明トレースを保持する。また、Augmented Lean Formalisation（ALF）を導入する。これは、希少な検証済みコーパスを形式的な命題のバリエーションに拡張し、変異させた各インスタンスを形式的に検証することなく、自己蒸留によって追加の学習信号を充填するものである。既知の問題をその形式的特性を保ちつつ摂動することで、ALFは任意の命題の表面的な形式への依存を低減する。実験的に、Pythagoras-Prover-4BはMiniF2F-Testにおいてpass@32でDeepSeek-Prover-V2-671Bを上回り（86.1%対82.4%）、パラメータ数は約167分の1である。一方、Pythagoras-Prover-32BはMiniF2F-Testで93.0%のオープンソース最高性能を達成し、PutnamBenchの672問題中93問を解く。我々はMiniF2F-ALFを公開する。これはALFによって変異された汚染に敏感なベンチマークであり、評価されたすべてのモデルが精度を低下させる。このベンチマークにおいて、我々の32Bモデルは最強であり、4Bモデルは従来の最高性能であるGoedel-Prover-V2-32Bに匹敵する。

English

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.