DeepSeekMath-V2:邁向可自我驗證的數學推理
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
November 27, 2025
作者: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang
cs.AI
摘要
大型語言模型在數學推理方面取得了顯著進展,這不僅是人工智慧的重要測試平台,若進一步發展更可能對科學研究產生深遠影響。通過採用獎勵最終正確答案的強化學習來擴展推理能力,LLM在一年內從表現不佳發展到在AIME和HMMT等定量推理競賽中達到飽和水平。然而,這種方法存在根本性局限:追求更高的最終答案準確率並未解決關鍵問題——正確答案不能保證推理過程的正確性。此外,諸如定理證明等數學任務需要嚴謹的逐步推導而非數值答案,使得最終答案獎勵機制無法適用。為突破深度推理的極限,我們認為必須驗證數學推理的全面性與嚴謹性。自我驗證對於擴展測試時計算資源尤為關鍵,特別是針對沒有已知解的開放性問題。
為實現可自我驗證的數學推理,我們研究如何訓練基於LLM的精準且可靠的定理證明驗證器。接著以該驗證器作為獎勵模型訓練證明生成器,激勵生成器在最終確定證明前盡可能識別並解決自身推理中的缺陷。為維持生成與驗證能力之間的差距隨生成器強化而持續存在,我們提出擴展驗證計算規模,自動標註新出現的難驗證證明,從而創建訓練數據以持續改進驗證器。我們最終推出的DeepSeekMath-V2模型展現出強大的定理證明能力,在擴展測試時計算資源的條件下,於IMO 2025和CMO 2024獲得金牌級評分,並在Putnam 2024中取得近乎滿分的118/120成績。
English
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.