FormalMATH:大型语言模型形式化数学推理能力基准测试
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
May 5, 2025
作者: Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu
cs.AI
摘要
形式化数学推理仍是人工智能面临的一项关键挑战,现有基准测试在广度和规模上的局限阻碍了其发展。为此,我们推出了FormalMATH,这是一个基于Lean4的大规模基准测试集,包含5,560个经过形式化验证的问题,涵盖从高中奥赛难题到本科水平定理的多个领域(如代数、应用数学、微积分、数论和离散数学)。为缓解手动形式化的低效问题,我们引入了一种新颖的人机协作自动形式化流程,该流程整合了:(1) 专门用于命题自动形式化的大语言模型(LLMs),(2) 多LLM语义验证机制,以及(3) 利用现成LLM证明器的基于否定的反证过滤策略。这一方法在确保与原始自然语言问题一致性的同时,通过保留72.09%的命题在人工验证前,显著降低了专家标注成本。我们对当前最先进的基于LLM的定理证明器进行评估,揭示了其显著局限性:即便最强大的模型在实际采样预算下也仅达到16.46%的成功率,表现出明显的领域偏好(如在代数中表现优异而在微积分中失败)及对简化自动化策略的过度依赖。值得注意的是,我们发现,在链式思维推理场景中,自然语言解题指导与证明成功率之间存在反直觉的负相关关系,这表明人类书写的非正式推理在形式化推理环境中引入了噪音而非清晰度。我们相信,FormalMATH为形式化数学推理的基准测试提供了一个强有力的平台。
English
Formal mathematical reasoning remains a critical challenge for artificial
intelligence, hindered by limitations of existing benchmarks in scope and
scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark
comprising 5,560 formally verified problems spanning from high-school Olympiad
challenges to undergraduate-level theorems across diverse domains (e.g.,
algebra, applied mathematics, calculus, number theory, and discrete
mathematics). To mitigate the inefficiency of manual formalization, we
introduce a novel human-in-the-loop autoformalization pipeline that integrates:
(1) specialized large language models (LLMs) for statement autoformalization,
(2) multi-LLM semantic verification, and (3) negation-based disproof filtering
strategies using off-the-shelf LLM-based provers. This approach reduces expert
annotation costs by retaining 72.09% of statements before manual verification
while ensuring fidelity to the original natural-language problems. Our
evaluation of state-of-the-art LLM-based theorem provers reveals significant
limitations: even the strongest models achieve only 16.46% success rate under
practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling
in algebra but failing in calculus) and over-reliance on simplified automation
tactics. Notably, we identify a counterintuitive inverse relationship between
natural-language solution guidance and proof success in chain-of-thought
reasoning scenarios, suggesting that human-written informal reasoning
introduces noise rather than clarity in the formal reasoning settings. We
believe that FormalMATH provides a robust benchmark for benchmarking formal
mathematical reasoning.Summary
AI-Generated Summary