DeepSeek-Prover-V1.5:利用证明助手反馈进行强化学习和蒙特卡洛树搜索
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
August 15, 2024
作者: Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, Chong Ruan
cs.AI
摘要
我们介绍了 DeepSeek-Prover-V1.5,这是一个开源语言模型,专为在 Lean 4 中进行定理证明而设计,通过优化训练和推理过程对 DeepSeek-Prover-V1 进行了增强。该模型在 DeepSeekMath-Base 上进行了预训练,专门针对形式数学语言进行了优化,然后使用从 DeepSeek-Prover-V1 衍生的增强形式定理证明数据集进行监督微调。进一步的改进是通过来自证明助手反馈的强化学习(RLPAF)实现的。除了 DeepSeek-Prover-V1 的单次整个证明生成方法,我们提出了 RMaxTS,这是蒙特卡洛树搜索的一种变体,采用内在奖励驱动的探索策略来生成多样化的证明路径。DeepSeek-Prover-V1.5 显示出明显的改进,相对于 DeepSeek-Prover-V1,在高中水平的 miniF2F 基准测试集(63.5%)和本科水平的 ProofNet 基准测试集(25.3%)上取得了新的最先进结果。
English
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for
theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both
training and inference processes. Pre-trained on DeepSeekMath-Base with
specialization in formal mathematical languages, the model undergoes supervised
fine-tuning using an enhanced formal theorem proving dataset derived from
DeepSeek-Prover-V1. Further refinement is achieved through reinforcement
learning from proof assistant feedback (RLPAF). Beyond the single-pass
whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a
variant of Monte-Carlo tree search that employs an intrinsic-reward-driven
exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5
demonstrates significant improvements over DeepSeek-Prover-V1, achieving new
state-of-the-art results on the test set of the high school level miniF2F
benchmark (63.5%) and the undergraduate level ProofNet benchmark (25.3%).Summary
AI-Generated Summary