DeepSeek-Prover-V1.5:利用證明助手的反饋進行強化學習和蒙特卡羅樹搜索
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
August 15, 2024
作者: Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, Chong Ruan
cs.AI
摘要
我們介紹了 DeepSeek-Prover-V1.5,這是一個針對 Lean 4 定理證明設計的開源語言模型,通過優化訓練和推論過程來增強 DeepSeek-Prover-V1。該模型在 DeepSeekMath-Base 上進行預訓練,專注於形式化數學語言,並使用從 DeepSeek-Prover-V1 衍生的增強形式定理證明數據集進行監督微調。通過來自證明助手反饋的強化學習(RLPAF)進行進一步的優化。除了 DeepSeek-Prover-V1 的單遍全證生成方法,我們提出了 RMaxTS,這是蒙特卡羅樹搜索的一個變體,採用內在獎勵驅動的探索策略來生成多樣的證明路徑。DeepSeek-Prover-V1.5 在高中水平的 miniF2F 基準測試集(63.5%)和本科水平的 ProofNet 基準測試集(25.3%)上取得了顯著的改進,超越了 DeepSeek-Prover-V1,達到了新的最先進結果。
English
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for
theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both
training and inference processes. Pre-trained on DeepSeekMath-Base with
specialization in formal mathematical languages, the model undergoes supervised
fine-tuning using an enhanced formal theorem proving dataset derived from
DeepSeek-Prover-V1. Further refinement is achieved through reinforcement
learning from proof assistant feedback (RLPAF). Beyond the single-pass
whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a
variant of Monte-Carlo tree search that employs an intrinsic-reward-driven
exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5
demonstrates significant improvements over DeepSeek-Prover-V1, achieving new
state-of-the-art results on the test set of the high school level miniF2F
benchmark (63.5%) and the undergraduate level ProofNet benchmark (25.3%).Summary
AI-Generated Summary