定理羊：將通用LLM轉換為Lean4專家

摘要

利用像 Lean 這樣的計算機可驗證形式語言證明數學定理，對數學推理產生了重大影響。一種形式定理證明的方法涉及使用基於自然語言 (NL) 證明的大型語言模型 (LLMs) 生成完整證明。類似方法在代碼生成方面已經顯示出有希望的結果。然而，由於 NL 與形式語言 (FL) 定理證明數據的匹配性不足，大多數現代 LLMs 表現出了次優異的性能。這種匱乏導致了訓練 LLMs 和充分利用其能力來構成形式證明的方法的不足。為應對這些挑戰，本文提出了 **TheoremLlama**，這是一個端到端框架，用於訓練通用 LLM 成為 Lean4 專家。該框架包括 NL-FL 對齊數據集生成方法、LLM形式定理證明器的訓練方法，以及LLM Lean4證明寫作技巧。通過數據集生成方法，我們提供了 *Open Bootstrapped Theorems* (OBT)，這是一個 NL-FL 對齊並啟動的數據集。該框架的一個關鍵創新是 NL-FL 啟動方法，其中 NL 證明被整合到 Lean4 代碼中用於訓練數據集，利用LLMs的 NL 推理能力進行形式推理。**TheoremLlama** 框架在 MiniF2F-Valid 和測試數據集上實現了 36.48% 和 33.61% 的累積準確率，分別超過了 GPT-4 的基準 22.95% 和 25.41%。我們還公開了我們的模型檢查點和生成的數據集，並將很快將所有代碼公開提供。

English

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

定理羊：將通用LLM轉換為Lean4專家

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

摘要

Support