ChatPaper.aiChatPaper

FormalMATH:大型語言模型形式化數學推理能力基準測試

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

May 5, 2025
作者: Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu
cs.AI

摘要

形式化數學推理仍然是人工智慧領域的一個關鍵挑戰,現有基準在範圍和規模上的限制阻礙了其發展。為解決這一問題,我們提出了FormalMATH,這是一個基於Lean4的大規模基準,包含5,560個經過形式化驗證的問題,涵蓋從高中奧林匹克競賽到大學本科水平的各類定理,涉及多個領域(如代數、應用數學、微積分、數論和離散數學)。為減少手動形式化的低效性,我們引入了一種新穎的人機協同自動形式化流程,該流程整合了:(1) 專用於語句自動形式化的大型語言模型(LLMs),(2) 多LLM語意驗證,以及(3) 基於否定的反證過濾策略,利用現成的LLM證明器。這一方法在確保與原始自然語言問題一致性的同時,將專家註釋成本降低了72.09%。我們對基於最先進LLM的定理證明器進行了評估,結果顯示其存在顯著局限性:即使在實際取樣預算下,最強的模型也僅達到16.46%的成功率,表現出明顯的領域偏見(例如在代數中表現出色但在微積分中失敗)以及對簡化自動化策略的過度依賴。值得注意的是,我們發現了在鏈式推理場景中,自然語言解決方案指導與證明成功率之間存在反直覺的負相關關係,這表明人類撰寫的非形式化推理在形式化推理環境中引入了噪音而非清晰度。我們相信,FormalMATH為形式化數學推理的基準測試提供了一個堅實的基礎。
English
Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Summary

AI-Generated Summary

PDF191May 6, 2025