DeepSeekMath:在开放式语言模型中拓展数学推理的极限
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
February 5, 2024
作者: Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo
cs.AI
摘要
数学推理对语言模型构成重大挑战,因为其复杂且结构化的特性。本文介绍了DeepSeekMath 7B,它在Common Crawl获取的120B与自然语言和代码数据相关的数学标记的基础上,继续预训练DeepSeek-Coder-Base-v1.5 7B。DeepSeekMath 7B在竞赛级别的MATH基准测试中取得了令人印象深刻的51.7%的得分,而无需依赖外部工具包和投票技术,接近Gemini-Ultra和GPT-4的性能水平。DeepSeekMath 7B的64个样本上的自一致性在MATH上达到了60.9%。DeepSeekMath的数学推理能力归因于两个关键因素:首先,我们通过精心设计的数据选择管道利用公开可获得的网络数据的巨大潜力。其次,我们引入了Group Relative Policy Optimization (GRPO),这是Proximal Policy Optimization (PPO)的一种变体,可以增强数学推理能力,同时优化PPO的内存使用。
English
Mathematical reasoning poses a significant challenge for language models due
to its complex and structured nature. In this paper, we introduce DeepSeekMath
7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B
math-related tokens sourced from Common Crawl, together with natural language
and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the
competition-level MATH benchmark without relying on external toolkits and
voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.
Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH.
The mathematical reasoning capability of DeepSeekMath is attributed to two key
factors: First, we harness the significant potential of publicly available web
data through a meticulously engineered data selection pipeline. Second, we
introduce Group Relative Policy Optimization (GRPO), a variant of Proximal
Policy Optimization (PPO), that enhances mathematical reasoning abilities while
concurrently optimizing the memory usage of PPO.