DeepSeek-Prover-V1.5: 증명 보조 프로그램 피드백을 활용한 강화 학습과 몬테카를로 트리 탐색

초록

DeepSeek-Prover-V1.5를 소개합니다. 이는 Lean 4에서 정리 증명을 위해 설계된 오픈 소스 언어 모델로, 교육 및 추론 과정을 최적화하여 DeepSeek-Prover-V1을 개선합니다. DeepSeekMath-Base에서 사전 훈련된 이 모델은 공식 수학 언어에 특화되어 있으며, DeepSeek-Prover-V1에서 유도된 향상된 형식적 정리 증명 데이터셋을 사용한 지도 미세 조정을 거칩니다. RLPAF(증명 보조 피드백으로부터의 강화 학습)을 통해 추가적인 세밀화가 이루어집니다. DeepSeek-Prover-V1의 단일 통과 전체 증명 생성 방식을 넘어, 우리는 다양한 증명 경로를 생성하기 위해 내재 보상 주도 탐색 전략을 채택하는 Monte-Carlo 트리 탐색의 변형인 RMaxTS를 제안합니다. DeepSeek-Prover-V1.5는 DeepSeek-Prover-V1에 비해 상당한 향상을 보여주며, 고등학교 수준 miniF2F 벤치마크 테스트 세트(63.5%)와 대학 수준 ProofNet 벤치마크(25.3%)에서 새로운 최첨단 결과를 달성합니다.

English

We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark (63.5%) and the undergraduate level ProofNet benchmark (25.3%).

DeepSeek-Prover-V1.5: 증명 보조 프로그램 피드백을 활용한 강화 학습과 몬테카를로 트리 탐색

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

초록

Support