EvolProver:通过对称性与难度演化形式化问题,推动自动定理证明发展
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
October 1, 2025
作者: Yuchen Tian, Ruiyuan Huang, Xuanwu Wang, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, Lun Du
cs.AI
摘要
大型语言模型(LLMs)在形式定理证明领域展现出显著潜力,但其普遍性不足,且对问题陈述的微小变换极为敏感。为克服这一局限,我们引入了一种新颖的数据增强流程,旨在从对称性与难度两个维度提升模型鲁棒性。在对称性方面,我们提出了两种互补方法:EvolAST,一种基于抽象语法树(AST)的技术,通过瞄准句法对称性生成语义等价的问题变体;以及EvolDomain,它利用LLMs跨越数学领域翻译定理,以处理语义对称性。在难度层面,我们设计了EvolDifficulty,运用精心构造的进化指令引导LLMs生成难度范围更广的新定理。随后,我们利用这些进化数据训练了EvolProver,一个拥有70亿参数的非推理定理证明器。EvolProver在FormalMATH-Lite上以53.8%的pass@32率创下新纪录,超越了所有规模相当的模型,包括基于推理的模型。同时,它还在MiniF2F-Test(69.8% pass@32)、Ineq-Comp-Seed(52.2% pass@32)及Ineq-Comp-Transformed(34.0% pass@32)上为非推理模型设立了新的标杆。消融研究进一步验证了我们的数据增强流程在多个基准测试中的有效性。
English
Large Language Models (LLMs) for formal theorem proving have shown
significant promise, yet they often lack generalizability and are fragile to
even minor transformations of problem statements. To address this limitation,
we introduce a novel data augmentation pipeline designed to enhance model
robustness from two perspectives: symmetry and difficulty. From the symmetry
perspective, we propose two complementary methods: EvolAST, an Abstract Syntax
Tree (AST) based approach that targets syntactic symmetry to generate
semantically equivalent problem variants, and EvolDomain, which leverages LLMs
to address semantic symmetry by translating theorems across mathematical
domains. From the difficulty perspective, we propose EvolDifficulty, which uses
carefully designed evolutionary instructions to guide LLMs in generating new
theorems with a wider range of difficulty. We then use the evolved data to
train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver
establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8%
pass@32 rate, surpassing all models of comparable size, including
reasoning-based models. It also sets new SOTA records for non-reasoning models
on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and
Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our
data augmentation pipeline's effectiveness across multiple benchmarks.