Leanabell-Prover: 形式推論におけるポストトレーニングスケーリング

要旨

最近のLLM（大規模言語モデル）を用いた自動定理証明（ATP）の進展により、Lean 4コードを用いた形式的推論の可能性が浮き彫りになっています。しかし、ATPはOpenAIのO1/O3やDeepseekのR1で示されたような最近のポストトレーニングスケーリングによる革命をまだ経験していません。本研究では、ATPのポストトレーニング全体を調査し、自然言語における推論モデルのブレークスルーと整合させることを目指します。まず、現在のATPモデルを、多数のステートメント-証明ペアと、人間の推論や仮説精緻化を模倣する認知行動を取り入れるための追加データからなるハイブリッドデータセットで継続的にトレーニングします。次に、Lean 4コンパイラによって返される結果報酬を用いた強化学習を探求します。私たちが設計した継続的トレーニングと強化学習のプロセスを通じて、DeepSeek-Prover-v1.5やGoedel-Proverを含む既存の形式的証明器を改善し、全証明生成の分野で最先端の性能を達成しました。例えば、MiniF2Fにおいて59.8%のパス率（pass@32）を達成しています。これは進行中のプロジェクトであり、私たちは発見を段階的に更新し、データとトレーニングの詳細を公開していく予定です。

English

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages.To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Leanabell-Prover: 形式推論におけるポストトレーニングスケーリング

Leanabell-Prover: Posttraining Scaling in Formal Reasoning

要旨

Support