LeanDojo: 검색 강화 언어 모델을 활용한 정리 증명

초록

대규모 언어 모델(LLMs)은 Lean과 같은 증명 보조 도구를 사용하여 형식적 정리를 증명하는 데 유망한 가능성을 보여주었다. 그러나 기존 방법들은 비공개 코드, 데이터, 그리고 대규모 컴퓨팅 자원 요구 사항으로 인해 재현하거나 확장하기 어려운 문제가 있었다. 이는 정리 증명을 위한 기계 학습 방법 연구에 상당한 장벽을 만들어냈다. 본 논문은 이러한 장벽을 제거하기 위해 LeanDojo를 소개한다: LeanDojo는 툴킷, 데이터, 모델, 벤치마크로 구성된 오픈소스 Lean 플레이그라운드이다. LeanDojo는 Lean에서 데이터를 추출하고 프로그램적으로 증명 환경과 상호작용할 수 있도록 한다. 이는 증명 내 전제(premise)에 대한 세밀한 주석을 포함하여, 정리 증명의 주요 병목 현상인 전제 선택(premise selection)을 위한 귀중한 데이터를 제공한다. 이 데이터를 활용하여, 우리는 ReProver(Retrieval-Augmented Prover)를 개발했다: ReProver는 방대한 수학 라이브러리에서 전제를 선택하기 위해 검색(retrieval)을 강화한 최초의 LLM 기반 증명기이다. 이는 비용이 저렴하며 단일 GPU로 일주일 동안의 훈련만으로도 충분하다. 우리의 검색기는 LeanDojo의 프로그램 분석 기능을 활용하여 접근 가능한 전제와 어려운 부정 예제(hard negative examples)를 식별함으로써 검색을 훨씬 더 효과적으로 만든다. 또한, 우리는 Lean의 수학 라이브러리에서 추출한 96,962개의 정리와 증명으로 구성된 새로운 벤치마크를 구축했다. 이 벤치마크는 훈련 중에 사용되지 않은 새로운 전제에 의존하는 정리로 일반화해야 하는 도전적인 데이터 분할을 특징으로 한다. 우리는 이 벤치마크를 훈련과 평가에 사용하며, 실험 결과는 ReProver가 검색을 사용하지 않는 베이스라인과 GPT-4보다 우수한 성능을 보임을 입증한다. 따라서 우리는 독점 데이터셋 없이 오픈소스 LLM 기반 정리 증명기를 최초로 제공하고, 이를 허가적인 MIT 라이선스 하에 공개하여 추가 연구를 촉진한다.

English

Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

LeanDojo: 검색 강화 언어 모델을 활용한 정리 증명

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

초록

Support