LeanDojo:搭載檢索增強語言模型的定理證明
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
June 27, 2023
作者: Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar
cs.AI
摘要
大型語言模型(LLMs)已顯示出在使用證明助手(如Lean)證明形式定理方面具有潛力。然而,由於私有代碼、數據和大量計算需求,現有方法難以復制或擴展。這給定理證明機器學習方法的研究帶來了重大障礙。本文通過引入LeanDojo來消除這些障礙:這是一個開源的Lean遊樂場,包括工具包、數據、模型和基準測試。LeanDojo從Lean中提取數據,並使得可以以程序方式與證明環境進行交互。它包含對證明中前提的精細注釋,為前提選擇提供了寶貴的數據:這是定理證明中的一個關鍵瓶頸。利用這些數據,我們開發了ReProver(檢索增強證明器):這是第一個以LLM為基礎的證明器,其搭配檢索以從龐大的數學庫中選擇前提。它成本低廉,僅需一個GPU週的訓練。我們的檢索器利用LeanDojo的程序分析功能來識別可訪問的前提和困難的負例,這使得檢索更加有效。此外,我們構建了一個新的基準測試,包括從Lean的數學庫中提取的96962個定理和證明。它具有具有挑戰性的數據分割,要求證明器對依賴於訓練中從未使用過的新前提的定理進行泛化。我們使用這個基準測試進行訓練和評估,實驗結果顯示ReProver相對於非檢索基線和GPT-4的效果。因此,我們提供了第一組無專有數據集的開源LLM基礎定理證明器,並根據寬鬆的MIT許可證發布,以促進進一步研究。
English
Large language models (LLMs) have shown promise in proving formal theorems
using proof assistants such as Lean. However, existing methods are difficult to
reproduce or build on, due to private code, data, and large compute
requirements. This has created substantial barriers to research on machine
learning methods for theorem proving. This paper removes these barriers by
introducing LeanDojo: an open-source Lean playground consisting of toolkits,
data, models, and benchmarks. LeanDojo extracts data from Lean and enables
interaction with the proof environment programmatically. It contains
fine-grained annotations of premises in proofs, providing valuable data for
premise selection: a key bottleneck in theorem proving. Using this data, we
develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that
is augmented with retrieval for selecting premises from a vast math library. It
is inexpensive and needs only one GPU week of training. Our retriever leverages
LeanDojo's program analysis capability to identify accessible premises and hard
negative examples, which makes retrieval much more effective. Furthermore, we
construct a new benchmark consisting of 96,962 theorems and proofs extracted
from Lean's math library. It features challenging data split requiring the
prover to generalize to theorems relying on novel premises that are never used
in training. We use this benchmark for training and evaluation, and
experimental results demonstrate the effectiveness of ReProver over
non-retrieval baselines and GPT-4. We thus provide the first set of open-source
LLM-based theorem provers without any proprietary datasets and release it under
a permissive MIT license to facilitate further research.