MegaMath: 오픈 수학 코퍼스의 한계를 넘어서기

초록

수학적 추론은 인간 지능의 초석이자 대규모 언어 모델(LLM)의 고급 능력을 평가하는 핵심 벤치마크입니다. 그러나 연구 커뮤니티는 여전히 수학 중심의 LLM 사전 학습 요구에 맞춘 공개적이고 대규모이며 고품질의 코퍼스가 부족한 실정입니다. 우리는 MegaMath를 제안합니다. 이는 다양한 수학 중심 소스에서 선별된 공개 데이터셋으로, 다음과 같은 방법론을 통해 구축되었습니다: (1) 웹 데이터 재검토: Common Crawl에서 수학 중심 HTML 최적화, fasttext 기반 필터링 및 중복 제거를 통해 인터넷 상의 고품질 데이터를 재추출했습니다. (2) 수학 관련 코드 데이터 재활용: 대규모 코드 학습 코퍼스인 Stack-V2에서 고품질 수학 관련 코드를 식별하여 데이터 다양성을 더욱 강화했습니다. (3) 합성 데이터 탐구: 웹 데이터 또는 코드 데이터로부터 QA 스타일 텍스트, 수학 관련 코드, 텍스트-코드 블록을 혼합하여 합성했습니다. 이러한 전략들을 통합하고 광범위한 제거 실험을 통해 효과를 검증함으로써, MegaMath는 기존 공개 수학 사전 학습 데이터셋 중 가장 많은 371B 토큰을 제공하며 최상의 품질을 자랑합니다.

English

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

MegaMath: 오픈 수학 코퍼스의 한계를 넘어서기

MegaMath: Pushing the Limits of Open Math Corpora

초록

Support