TheoremLlama: 범용 LLM을 Lean4 전문가로 변환하기

초록

Lean과 같은 컴퓨터 검증 가능한 형식 언어를 사용하여 수학 정리를 증명하는 것은 수학적 추론에 상당한 영향을 미칩니다. 형식적 정리 증명에 대한 한 가지 접근 방식은 자연어(NL) 증명을 기반으로 대규모 언어 모델(LLM)을 사용하여 완전한 증명을 생성하는 것입니다. 유사한 방법들은 코드 생성에서 유망한 결과를 보여주었습니다. 그러나 대부분의 현대 LLM은 정렬된 NL과 형식 언어(FL) 정리 증명 데이터의 부족으로 인해 최적의 성능을 보이지 못하고 있습니다. 이러한 부족은 LLM을 훈련시키는 방법론과 형식적 증명 작성을 위해 그들의 능력을 완전히 활용하는 기술의 부족으로 이어집니다. 이러한 문제를 해결하기 위해, 본 논문은 **TheoremLlama**라는 종단 간(end-to-end) 프레임워크를 제안합니다. 이 프레임워크는 NL-FL 정렬 데이터셋 생성 방법, LLM 형식 정리 증명기 훈련 접근법, 그리고 LLM Lean4 증명 작성 기술을 포함합니다. 데이터셋 생성 방법을 사용하여, 우리는 NL-FL 정렬 및 부트스트랩된 데이터셋인 *Open Bootstrapped Theorems* (OBT)를 제공합니다. 이 프레임워크의 주요 혁신은 NL 증명이 Lean4 코드에 통합되어 훈련 데이터셋을 생성하는 NL-FL 부트스트래핑 방법으로, LLM의 NL 추론 능력을 형식적 추론에 활용합니다. **TheoremLlama** 프레임워크는 MiniF2F-Valid 및 Test 데이터셋에서 각각 36.48%와 33.61%의 누적 정확도를 달성하여 GPT-4 기준선인 22.95%와 25.41%를 능가합니다. 또한, 우리는 모델 체크포인트와 생성된 데이터셋을 오픈소스로 공개했으며, 곧 모든 코드를 공개할 예정입니다.

English

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

TheoremLlama: 범용 LLM을 Lean4 전문가로 변환하기

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

초록

Support