ChatPaper.aiChatPaper

DeepSeekMath:在開放式語言模型中拓展數學推理的極限

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

February 5, 2024
作者: Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo
cs.AI

摘要

數學推理對語言模型來說是一項重大挑戰,因為其複雜且結構化的特性。本文介紹了 DeepSeekMath 7B,它在 Common Crawl 中提取了 120B 與數學相關的 token,並與自然語言和程式碼數據一起對 DeepSeek-Coder-Base-v1.5 7B 進行了持續預訓練。DeepSeekMath 7B 在不依賴外部工具包和投票技術的情況下,在競賽級 MATH 基準測試中取得了令人印象深刻的 51.7% 分數,接近 Gemini-Ultra 和 GPT-4 的表現水平。DeepSeekMath 7B 的 64 個樣本上的自我一致性達到了 60.9% 的 MATH 分數。DeepSeekMath 的數學推理能力歸因於兩個關鍵因素:首先,我們通過精心設計的數據選擇流程,利用公開可用的網絡數據的巨大潛力。其次,我們引入了 Group Relative Policy Optimization (GRPO),這是 Proximal Policy Optimization (PPO) 的一個變體,它增強了數學推理能力,同時優化了 PPO 的內存使用。
English
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
PDF1246December 15, 2024