DeepSeek-Prover-V1.5：利用证明助手反馈进行强化学习和蒙特卡洛树搜索

摘要

我们介绍了 DeepSeek-Prover-V1.5，这是一个开源语言模型，专为在 Lean 4 中进行定理证明而设计，通过优化训练和推理过程对 DeepSeek-Prover-V1 进行了增强。该模型在 DeepSeekMath-Base 上进行了预训练，专门针对形式数学语言进行了优化，然后使用从 DeepSeek-Prover-V1 衍生的增强形式定理证明数据集进行监督微调。进一步的改进是通过来自证明助手反馈的强化学习（RLPAF）实现的。除了 DeepSeek-Prover-V1 的单次整个证明生成方法，我们提出了 RMaxTS，这是蒙特卡洛树搜索的一种变体，采用内在奖励驱动的探索策略来生成多样化的证明路径。DeepSeek-Prover-V1.5 显示出明显的改进，相对于 DeepSeek-Prover-V1，在高中水平的 miniF2F 基准测试集（63.5%）和本科水平的 ProofNet 基准测试集（25.3%）上取得了新的最先进结果。

English

We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark (63.5%) and the undergraduate level ProofNet benchmark (25.3%).

DeepSeek-Prover-V1.5：利用证明助手反馈进行强化学习和蒙特卡洛树搜索

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

摘要

Support