DeepSeek-Prover：通过大规模合成数据推动LLM中的定理证明

摘要

像Lean这样的证明助手已经彻底改变了数学证明验证的方式，确保了高准确性和可靠性。尽管大型语言模型（LLMs）在数学推理方面表现出潜力，但它们在形式定理证明方面的进展受到训练数据的不足所限。为了解决这个问题，我们提出了一种方法，通过从高中和本科水平的数学竞赛问题中生成大量Lean 4证明数据。这种方法涉及将自然语言问题翻译成形式陈述，过滤掉低质量的陈述，并生成证明以创建合成数据。在对DeepSeekMath 7B模型在这个包含800万个带证明的形式陈述的合成数据集上进行微调后，我们的模型在Lean 4 miniF2F测试中以64个样本达到了46.3%的整体证明生成准确率，并在累计上达到了52%，超过了基准GPT-4的23.0%（64个样本）以及一种树搜索强化学习方法的41.0%。此外，我们的模型成功证明了Lean 4正式国际数学奥林匹克（FIMO）基准中的148个问题中的5个，而GPT-4则未能证明任何一个。这些结果展示了利用大规模合成数据提升LLMs中定理证明能力的潜力。合成数据集和模型都将提供以促进这一领域的进一步研究。

English

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

DeepSeek-Prover：通过大规模合成数据推动LLM中的定理证明

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

摘要

Support