ChatPaper.aiChatPaper

哥德爾證明者:開源自動定理證明的前沿模型

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

February 11, 2025
作者: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin
cs.AI

摘要

我們介紹 Goedel-Prover,一個開源的大型語言模型(LLM),在數學問題的自動形式證明生成方面實現了最先進的性能。這個領域的主要挑戰在於正式化數學陳述和證明的稀缺性,我們通過以下方式應對。我們訓練陳述正式化器將 Numina 的自然語言數學問題轉換為正式語言(Lean 4),創建了一個包含 1.64 百萬個正式陳述的數據集。LLM 用於檢查正式陳述是否準確地保留了原始自然語言問題的內容。然後,我們通過訓練一系列證明器來迭代地構建一個大型的正式證明數據集。每個證明器成功地證明了許多之前的證明器無法證明的陳述,這些新的證明被添加到下一個證明器的訓練集中。最終的證明器在整個證明生成方面優於所有現有的開源模型。在 miniF2F 基準測試中,它實現了 57.6% 的成功率(Pass@32),超過之前最好的開源模型 7.6%。在 PutnamBench 上,Goedel-Prover 成功解決了 7 個問題(Pass@512),在排行榜上名列第一。此外,它為 Lean Workbook 問題生成了 29.7K 個正式證明,幾乎是之前作品產生的 15.7K 的兩倍。
English
We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems. The key challenge in this field is the scarcity of formalized math statements and proofs, which we tackle in the following ways. We train statement formalizers to translate the natural language math problems from Numina into formal language (Lean 4), creating a dataset of 1.64 million formal statements. LLMs are used to check that the formal statements accurately preserve the content of the original natural language problems. We then iteratively build a large dataset of formal proofs by training a series of provers. Each prover succeeds in proving many statements that the previous ones could not, and these new proofs are added to the training set for the next prover. The final prover outperforms all existing open-source models in whole-proof generation. On the miniF2F benchmark, it achieves a 57.6% success rate (Pass@32), exceeding the previous best open-source model by 7.6%. On PutnamBench, Goedel-Prover successfully solves 7 problems (Pass@512), ranking first on the leaderboard. Furthermore, it generates 29.7K formal proofs for Lean Workbook problems, nearly doubling the 15.7K produced by earlier works.

Summary

AI-Generated Summary

PDF82February 12, 2025