ChatPaper.aiChatPaper

定理Llama:将通用LLM转化为Lean4专家

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

July 3, 2024
作者: Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang
cs.AI

摘要

利用像 Lean 这样的计算机可验证形式语言证明数学定理显著影响数学推理。一种形式定理证明的方法涉及使用基于自然语言 (NL) 证明的大型语言模型 (LLMs) 生成完整证明。类似的方法在代码生成方面显示出有希望的结果。然而,大多数现代 LLMs 由于缺乏对齐的 NL 和形式语言 (FL) 定理证明数据而表现出次优性能。这种稀缺性导致缺乏用于训练 LLMs 和充分利用其能力来组成形式证明的方法论。为了解决这些挑战,本文提出了**TheoremLlama**,一个端到端框架,用于训练通用 LLM 成为 Lean4 专家。该框架包括 NL-FL 对齐数据集生成方法、LLM形式定理证明器的训练方法,以及LLM Lean4证明编写技术。利用数据集生成方法,我们提供了*开放引导定理* (OBT),一个 NL-FL 对齐和引导式数据集。该框架的一个关键创新是 NL-FL 引导方法,其中 NL 证明被整合到 Lean4 代码中以用于训练数据集,利用 LLMs 的 NL 推理能力进行形式推理。**TheoremLlama** 框架在 MiniF2F-Valid 和测试数据集上分别实现了 36.48% 和 33.61% 的累积准确率,超过了 GPT-4 基线的 22.95% 和 25.41%。我们还公开了我们的模型检查点和生成的数据集,并将很快公开所有代码。
English
Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

Summary

AI-Generated Summary

PDF121November 28, 2024