DeepSeekMath：在開放式語言模型中拓展數學推理的極限

摘要

數學推理對語言模型來說是一項重大挑戰，因為其複雜且結構化的特性。本文介紹了 DeepSeekMath 7B，它在 Common Crawl 中提取了 120B 與數學相關的 token，並與自然語言和程式碼數據一起對 DeepSeek-Coder-Base-v1.5 7B 進行了持續預訓練。DeepSeekMath 7B 在不依賴外部工具包和投票技術的情況下，在競賽級 MATH 基準測試中取得了令人印象深刻的 51.7% 分數，接近 Gemini-Ultra 和 GPT-4 的表現水平。DeepSeekMath 7B 的 64 個樣本上的自我一致性達到了 60.9% 的 MATH 分數。DeepSeekMath 的數學推理能力歸因於兩個關鍵因素：首先，我們通過精心設計的數據選擇流程，利用公開可用的網絡數據的巨大潛力。其次，我們引入了 Group Relative Policy Optimization (GRPO)，這是 Proximal Policy Optimization (PPO) 的一個變體，它增強了數學推理能力，同時優化了 PPO 的內存使用。

English

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

DeepSeekMath：在開放式語言模型中拓展數學推理的極限

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

摘要

Support