Mathesis:邁向從自然語言進行形式化定理證明
Mathesis: Towards Formal Theorem Proving from Natural Languages
June 8, 2025
作者: Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li
cs.AI
摘要
近期大型語言模型的進展展現了在形式推理方面的巨大潛力。然而,大多數基於LLM的定理證明器長期以來受限於需要專家撰寫的形式化陳述作為輸入,這限制了它們在處理以自然語言表達的現實世界問題時的適用性。我們通過Mathesis來解決這一差距,這是第一個處理非正式問題陳述的端到端定理證明流程。它貢獻了Mathesis-Autoformalizer,這是首個利用強化學習來增強自然語言問題形式化能力的自動形式化工具,並輔以我們新穎的LeanScorer框架進行細緻的形式化質量評估。此外,它還提出了Mathesis-Prover,用於從形式化陳述生成形式化證明。為了評估端到端形式定理證明的實際應用性,我們引入了Gaokao-Formal,這是一個包含488道來自中國高考的複雜問題的基準測試集。我們的方法經過精心設計,對每個組件進行了深入研究。實驗證明了Mathesis的有效性,其中自動形式化工具在Gaokao-Formal上的通過率比最佳基線高出22%。整個系統超越了其他模型組合,在MiniF2F上以pass@32達到了64%的準確率,並在Gaokao-Formal上取得了18%的領先水平。
English
Recent advances in large language models show strong promise for formal
reasoning. However, most LLM-based theorem provers have long been constrained
by the need for expert-written formal statements as inputs, limiting their
applicability to real-world problems expressed in natural language. We tackle
this gap with Mathesis, the first end-to-end theorem proving pipeline
processing informal problem statements. It contributes Mathesis-Autoformalizer,
the first autoformalizer using reinforcement learning to enhance the
formalization ability of natural language problems, aided by our novel
LeanScorer framework for nuanced formalization quality assessment. It also
proposes a Mathesis-Prover, which generates formal proofs from the formalized
statements. To evaluate the real-world applicability of end-to-end formal
theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex
problems from China's national college entrance exam. Our approach is carefully
designed, with a thorough study of each component. Experiments demonstrate
Mathesis's effectiveness, with the autoformalizer outperforming the best
baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other
model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a
state-of-the-art 18% on Gaokao-Formal.