Leanabell-Prover: 형식적 추론에서의 사후 학습 스케일링

초록

최근 LLM(Large Language Model)을 통한 자동 정리 증명(ATP)의 발전은 Lean 4 코드를 활용한 형식적 추론의 잠재력을 부각시켰습니다. 그러나 ATP는 OpenAI O1/O3와 Deepseek R1에서 입증된 최근의 사후 학습 스케일링(posttraining scaling)에 의해 혁신되지는 못했습니다. 본 연구에서는 자연어 추론 모델에서의 획기적인 발전과 ATP를 정렬시키기 위해 ATP의 전체 사후 학습 과정을 조사합니다. 먼저, 우리는 현재의 ATP 모델을 다양한 진술-증명 쌍과 인간의 추론 및 가설 정제를 모방하는 인지적 행동을 통합하기 위한 추가 데이터로 구성된 하이브리드 데이터셋으로 지속적으로 학습시킵니다. 다음으로, Lean 4 컴파일러가 반환하는 결과 보상을 활용한 강화 학습을 탐구합니다. 우리가 설계한 지속적 학습 및 강화 학습 과정을 통해 DeepSeek-Prover-v1.5와 Goedel-Prover를 포함한 기존의 형식적 증명기를 개선하여 전체 증명 생성 분야에서 최첨단 성능을 달성했습니다. 예를 들어, MiniF2F에서 59.8%의 통과율(pass@32)을 달성했습니다. 이는 진행 중인 프로젝트이며, 우리는 점진적으로 연구 결과를 업데이트하고 데이터 및 학습 세부 사항을 공개할 예정입니다.

English

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages.To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Leanabell-Prover: 형식적 추론에서의 사후 학습 스케일링

Leanabell-Prover: Posttraining Scaling in Formal Reasoning

초록

Support