Pythagoras-Prover: 증강된 Lean 형식화를 통한 효율적 형식 증명의 진보

초록

현대의 Lean 정리 증명기는 검증된 증명 데이터의 부족과 형식 증명 탐색의 긴 추론 과정으로 인해, 상당한 학습 및 추론 컴퓨팅 자원이 있어야만 높은 성능을 달성하며, 이는 지도 미세 조정(SFT)과 샘플링 모두를 비용이 많이 드는 작업으로 만듭니다. 우리는 실용적인 컴퓨팅 예산에 맞춰 설계된 컴퓨팅 효율적인 오픈소스 Lean 정리 증명기 제품군인 Pythagoras-Prover를 소개합니다. 이 제품군은 두 가지 생성 패러다임을 포괄합니다: 4B 및 32B 매개변수의 자기회귀 모델과, 추론 시점에 Lean 증명을 반복적으로 개선하는 최초의 개념 증명 확산 기반 증명기(4B)입니다. 학습 효율성을 위해, 우리는 커리큘럼 SFT를 위한 쉬움, 중간, 어려움 문제로 계층화된 Lean 검증 말뭉치를 구축하여, 모델이 더 짧고 단순한 증명에서 더 길고 어려운 증명으로 점진적으로 증명 기술을 습득하도록 합니다. SFT 중에는 동적 증명 추론 필터링 기법을 통해 각 인스턴스를 8k 토큰 컨텍스트 예산 내로 유지하면서 정보성 있는 증명 추적을 보존합니다. 또한, 검증된 말뭉치를 형식 명제의 변형으로 확장하는 증강 Lean 형식화(ALF)를 도입합니다. ALF는 모든 변형 인스턴스를 형식 검증하지 않고 자가 증류를 통해 추가 학습 신호를 제공합니다. 알려진 문제를 변형하면서도 형식적 특성을 유지함으로써, ALF는 명제의 표면 형태에 대한 의존성을 줄입니다. 실험적으로, Pythagoras-Prover-4B는 MiniF2F-Test에서 pass@32 기준(86.1% 대 82.4%)으로 DeepSeek-Prover-V2-671B를 약 167배 적은 매개변수로 능가하며, Pythagoras-Prover-32B는 MiniF2F-Test에서 93.0%로 오픈소스 최고 성능을 달성하고 PutnamBench의 672개 문제 중 93개를 해결합니다. 우리는 MiniF2F-ALF라는 ALF 변형 기반 오염 민감 벤치마크를 공개하며, 이 벤치마크에서는 평가된 모든 모델의 정확도가 하락합니다. 여기서 우리의 32B 모델은 가장 강력한 성능을 유지하고, 4B 모델은 이전 최고 성능인 Goedel-Prover-V2-32B와 동등한 성능을 보입니다.

English

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.