DeepSeekMath-V2: 自己検証可能な数学推論に向けて

要旨

大規模言語モデルは数学的推論において著しい進歩を遂げており、これはAIの重要な試験場として機能し、さらなる発展により科学研究に影響を与える可能性がある。正解を報酬とする強化学習による推論のスケーリングにより、LLMは低い性能状態から1年でAIMEやHMMTといった定量的推論競技で飽和する水準まで改善した。しかし、このアプローチには根本的な限界が存在する。最終回答精度の向上を追求しても、正解が正しい推論を保証しないという核心的課題は解決されない。さらに、定理証明のような多くの数学的タスクは数値的な回答ではなく厳密な段階的導出を必要とするため、最終回答報酬は適用不能である。深い推論の限界に挑むためには、数学的推論の包括性と厳密性を検証する必要性があると考えられる。特に既知の解がない未解決問題に対して、テスト時計算資源のスケーリングにおいて自己検証は極めて重要である。自己検証可能な数学的推論に向けて、我々は定理証明における正確で忠実なLLMベースの検証器の訓練方法を検討する。次に、この検証器を報酬モデルとして用いて証明生成器を訓練し、生成器が自身の証明を最終確定する前に可能な限り多くの問題点を特定し解決するよう動機付ける。生成器が強くなるにつれて生成-検証ギャップを維持するため、検証計算資源をスケールアップして新たな検証困難な証明を自動的にラベル付けし、検証器をさらに改善する訓練データを作成する提案を行う。結果として得られたモデルDeepSeekMath-V2は、強力な定理証明能力を示し、IMO 2025とCMO 2024で金賞レベル、Putnam 2024ではスケーリングされたテスト時計算資源により120点満点中118点というほぼ完璧な成績を達成した。

English

Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.

DeepSeekMath-V2: 自己検証可能な数学推論に向けて

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

要旨

Support