ChatPaper.aiChatPaper

EvolProver:通過對稱性與難度演化形式化問題以推進自動定理證明

EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty

October 1, 2025
作者: Yuchen Tian, Ruiyuan Huang, Xuanwu Wang, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, Lun Du
cs.AI

摘要

大型語言模型(LLMs)在形式定理證明領域展現了顯著的潛力,然而其普遍性不足,且對問題陳述的微小變換表現脆弱。為解決這一限制,我們引入了一種新穎的數據增強流程,旨在從對稱性和難度兩個角度提升模型的魯棒性。從對稱性角度,我們提出了兩種互補方法:EvolAST,一種基於抽象語法樹(AST)的方法,針對語法對稱性生成語義等價的問題變體;以及EvolDomain,它利用LLMs跨越數學領域翻譯定理,以處理語義對稱性。從難度角度,我們提出了EvolDifficulty,該方法通過精心設計的進化指令引導LLMs生成具有更廣難度範圍的新定理。隨後,我們利用這些進化數據訓練了EvolProver,一個擁有70億參數的非推理定理證明器。EvolProver在FormalMATH-Lite上以53.8%的pass@32率創下了新的最優記錄(SOTA),超越了所有同等規模的模型,包括基於推理的模型。它還在MiniF2F-Test(69.8% pass@32)、Ineq-Comp-Seed(52.2% pass@32)和Ineq-Comp-Transformed(34.0% pass@32)上為非推理模型設立了新的SOTA記錄。消融研究進一步證實了我們的數據增強流程在多個基準測試中的有效性。
English
Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: EvolAST, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and EvolDomain, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose EvolDifficulty, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our data augmentation pipeline's effectiveness across multiple benchmarks.
PDF52October 7, 2025