DeepSeek-Prover：透過大規模合成數據推進LLM中的定理證明

摘要

像 Lean 這樣的證明助手已經徹底改變了數學證明的驗證，確保高準確性和可靠性。儘管大型語言模型（LLMs）在數學推理方面顯示出潛力，但它們在正式定理證明方面的進展受到訓練數據的不足所阻礙。為了解決這個問題，我們提出了一種方法，通過高中和大學水平的數學競賽問題生成廣泛的 Lean 4 證明數據。這種方法涉及將自然語言問題轉化為正式陳述，過濾掉低質量的陳述，並生成證明以創建合成數據。在對這個包含了 800 萬個帶有證明的正式陳述的合成數據集上對 DeepSeekMath 7B 模型進行微調後，我們的模型在 Lean 4 miniF2F 測試中以 64 個樣本達到了 46.3% 的整個證明生成準確率，累計達到了 52%，超過了基準 GPT-4 的 23.0%（64 個樣本）和樹搜索強化學習方法的 41.0%。此外，我們的模型成功證明了 Lean 4 正式國際數學奧林匹亞（FIMO）基準中的 148 個問題中的 5 個，而 GPT-4 則無法證明任何一個。這些結果展示了利用大規模合成數據來增強LLMs中定理證明能力的潛力。合成數據集和模型將提供以促進這一前景廣闊的領域的進一步研究。

English

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

DeepSeek-Prover：透過大規模合成數據推進LLM中的定理證明

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

摘要

Support