DeepSeekMath-V2:迈向可自我验证的数学推理
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
November 27, 2025
作者: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang
cs.AI
摘要
大型语言模型在数学推理领域取得了显著进展,该领域不仅是人工智能的重要测试平台,若进一步突破还可能对科学研究产生深远影响。通过采用强化学习对最终正确答案进行奖励的规模化推理训练,LLMs在一年内从表现不佳发展到在AIME、HMMT等定量推理竞赛中达到饱和水平。然而,这种方法存在根本性局限:追求更高的最终答案准确率并未解决关键问题——正确答案并不能保证推理过程的正确性。此外,定理证明等数学任务需要严格的逐步推导而非数值答案,使得最终答案奖励机制失效。为突破深度推理的极限,我们认为必须验证数学推理的完备性与严谨性。自我验证对于扩展测试时计算规模尤为重要,特别是针对尚无已知解的开放性问题。为实现可自我验证的数学推理,我们研究了如何训练基于LLM的精确可靠定理证明验证器,进而以该验证器作为奖励模型训练证明生成器,激励生成器在最终定稿前尽可能识别并修正自身证明中的问题。为防止生成器能力增强导致验证差距扩大,我们提出通过扩展验证计算来自动标注新的难验证证明,从而创建训练数据持续改进验证器。最终形成的DeepSeekMath-V2模型展现出强大的定理证明能力,在IMO 2025和CMO 2024中获得金奖,并在Putnam 2024中通过扩展测试时计算取得118/120的近满分成绩。
English
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.