Pythagoras-Prover：通過增強的Lean形式化推進高效形式化證明

摘要

现代Lean定理证明器仅在大量训练和推理计算资源下才能实现高性能，部分原因是经过验证的证明数据稀缺，以及形式化证明搜索过程中产生的长推理轨迹，这使得监督微调（SFT）和采样成本高昂。我们提出Pythagoras-Prover，一个面向实际计算预算的高效开源Lean定理证明器系列。该系列涵盖两代生成范式：参数规模为4B和32B的自回归模型，以及首个概念验证的基于扩散的证明器（4B），该证明器在推理时通过迭代优化生成Lean证明。在训练效率方面，我们构建了一个经过Lean验证的语料库，按难度分为简单、中等和困难问题，用于课程式监督微调，使模型能够从较短、较简单的证明逐步掌握较长、较困难的证明技能。在监督微调过程中，我们采用动态证明推理过滤方案，保留有信息量的证明轨迹，同时将每个实例的上下文预算限制在8k个token以内。我们还引入了增强式Lean形式化（ALF），该方法将稀缺的经过验证的语料库扩展为形式化陈述的变体，通过自蒸馏生成额外训练信号，而无需对每个突变实例进行形式化验证。通过扰动已知问题同时保留其形式化特征，ALF减少了对任何陈述表面形式的依赖。实验结果表明，在MiniF2F-Test测试集上，Pythagoras-Prover-4B在pass@32指标上以86.1%的成绩超越了DeepSeek-Prover-V2-671B的82.4%，且参数规模减少约167倍；而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下开源方法的最优水平，并在PutnamBench的672个问题中成功解决93题。我们发布了MiniF2F-ALF，一个经ALF突变处理的、对数据污染敏感的基准测试集，所有经过评估的模型在该测试集上的准确率均出现下降；在此基准上，我们的32B模型仍保持最强性能，而4B模型则与先前最优的开源方法Goedel-Prover-V2-32B持平。

English

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.