Pythagoras-Prover：通过增强的Lean形式化推进高效形式化证明

摘要

现代Lean定理证明器只有通过大量的训练和推理计算才能实现强大性能，这在一定程度上源于已验证证明数据的稀缺性以及形式化证明搜索中的长推理轨迹，导致监督微调（SFT）和采样成本高昂。我们提出Pythagoras-Prover，这是一个面向实际计算预算、计算高效的开放源码Lean定理证明器系列。该系列涵盖两种生成范式：参数规模为4B和32B的自回归模型，以及首个基于扩散的证明器概念验证（4B），该模型在推理时通过迭代方式精炼Lean证明。在训练效率方面，我们构建了一个经Lean验证的语料库，将其分层为简单、中等和困难问题以用于课程式监督微调，使模型能够从较短、较简单的证明逐步掌握到较长、较困难的证明技能。在监督微调过程中，我们采用动态证明推理过滤方案，在保留信息量丰富的证明轨迹的同时，将每个实例控制在8k token的上下文预算内。我们还引入了增强型Lean形式化（ALF），它将稀缺的验证语料扩展为形式化语句的变体，通过自蒸馏方式提供额外的训练信号，而无需对每个变异后的实例进行形式化验证。通过扰动已知问题同时保留其形式化特征，ALF减少了对任何语句表面形式的依赖。实验结果表明，Pythagoras-Prover-4B在MiniF2F-Test上的pass@32指标（86.1%对阵82.4%）超越了DeepSeek-Prover-V2-671B，而参数规模仅为后者的约1/167；Pythagoras-Prover-32B在MiniF2F-Test上达到93.0%，创下了开放源码的最优水平，并在PutnamBench的672个问题中解决了93个。我们发布了MiniF2F-ALF，这是一个经ALF变异处理的污染敏感基准测试集，所有经过评估的模型在此基准上的准确率均有下降；在此基准上，我们的32B模型仍然是最强的，而4B模型则与先前的最优模型Goedel-Prover-V2-32B性能相当。

English

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.