DeepSeekMath: オープン言語モデルにおける数学的推論の限界を押し広げる

要旨

数学的推論は、その複雑で構造化された性質から、言語モデルにとって重要な課題となっています。本論文では、DeepSeek-Coder-Base-v1.5 7BをCommon Crawlから収集した120Bの数学関連トークンと、自然言語およびコードデータを用いて継続的に事前学習したDeepSeekMath 7Bを紹介します。DeepSeekMath 7Bは、外部ツールキットや投票技術に依存することなく、競技レベルのMATHベンチマークで51.7%という印象的なスコアを達成し、Gemini-UltraやGPT-4の性能レベルに迫っています。DeepSeekMath 7Bの64サンプルにおける自己一貫性は、MATHで60.9%を達成しました。DeepSeekMathの数学的推論能力は、2つの主要な要因に起因しています。第一に、公開されているウェブデータの大きな可能性を、緻密に設計されたデータ選択パイプラインを通じて活用しています。第二に、Proximal Policy Optimization (PPO)の変種であるGroup Relative Policy Optimization (GRPO)を導入し、数学的推論能力を向上させると同時に、PPOのメモリ使用量を最適化しています。

English

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

DeepSeekMath: オープン言語モデルにおける数学的推論の限界を押し広げる

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

要旨

Support