DeepSeekMath: 개방형 언어 모델에서 수학적 추론의 한계를 넘어서기

초록

수학적 추론은 그 복잡하고 구조화된 특성으로 인해 언어 모델에게 상당한 도전 과제로 여겨져 왔다. 본 논문에서는 DeepSeek-Coder-Base-v1.5 7B 모델을 기반으로 Common Crawl에서 수집된 120B 개의 수학 관련 토큰과 자연어 및 코드 데이터를 추가하여 사전 학습을 계속한 DeepSeekMath 7B를 소개한다. DeepSeekMath 7B는 외부 도구나 투표 기법을 사용하지 않고도 경쟁 수준의 MATH 벤치마크에서 51.7%의 인상적인 점수를 달성하여 Gemini-Ultra와 GPT-4의 성능 수준에 근접했다. DeepSeekMath 7B의 64개 샘플에 대한 자기 일관성(self-consistency)은 MATH에서 60.9%를 기록했다. DeepSeekMath의 수학적 추론 능력은 두 가지 주요 요인에 기인한다: 첫째, 공개적으로 이용 가능한 웹 데이터의 잠재력을 정교하게 설계된 데이터 선택 파이프라인을 통해 활용한다. 둘째, Proximal Policy Optimization(PPO)의 변형인 Group Relative Policy Optimization(GRPO)을 도입하여 수학적 추론 능력을 향상시키면서 동시에 PPO의 메모리 사용을 최적화한다.

English

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

DeepSeekMath: 개방형 언어 모델에서 수학적 추론의 한계를 넘어서기

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

초록

Support