ChatPaper.aiChatPaper

Mathesis:迈向从自然语言进行形式化定理证明

Mathesis: Towards Formal Theorem Proving from Natural Languages

June 8, 2025
作者: Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li
cs.AI

摘要

近期大型语言模型的进展在形式推理领域展现出巨大潜力。然而,大多数基于LLM的定理证明器长期受限于需要专家编写的正式陈述作为输入,限制了其在自然语言表达的现实问题中的应用。我们通过Mathesis填补了这一空白,这是首个处理非正式问题陈述的端到端定理证明流程。它贡献了Mathesis-Autoformalizer,这是首个利用强化学习增强自然语言问题形式化能力的自动形式化器,并辅以我们新颖的LeanScorer框架进行细致的形式化质量评估。同时,Mathesis还提出了Mathesis-Prover,用于从形式化陈述生成正式证明。为了评估端到端形式定理证明在现实世界中的适用性,我们引入了Gaokao-Formal,这是一个包含488道中国高考复杂题目的基准测试集。我们的方法经过精心设计,对每个组件进行了深入研究。实验证明了Mathesis的有效性,其自动形式化器在Gaokao-Formal上的通过率比最佳基线高出22%。完整系统超越了其他模型组合,在MiniF2F上以pass@32达到64%的准确率,并在Gaokao-Formal上取得了18%的顶尖水平。
English
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.
PDF22June 11, 2025