MegaMath: オープン数学コーパスの限界に挑む

要旨

数学的推論は人間の知性の基盤であり、大規模言語モデル（LLM）の高度な能力を測る重要な指標です。しかし、研究コミュニティはまだ、数学中心のLLM事前学習の要求に応えるためのオープンで大規模かつ高品質なコーパスを欠いています。本論文では、MegaMathを紹介します。これは、以下の手法を用いて多様な数学関連のソースからキュレーションされたオープンデータセットです。(1) Webデータの再検討: Common Crawlから数学文書を再抽出し、数学指向のHTML最適化、fasttextベースのフィルタリング、重複排除を行い、インターネット上の高品質なデータを取得しました。(2) 数学関連コードデータの再収集: 大規模なコードトレーニングコーパスであるStack-V2から高品質な数学関連コードを特定し、データの多様性をさらに高めました。(3) 合成データの探索: WebデータやコードデータからQA形式のテキスト、数学関連コード、テキストとコードが交互に現れるブロックを合成しました。これらの戦略を統合し、広範なアブレーションを通じてその有効性を検証することで、MegaMathは既存のオープンな数学事前学習データセットの中で最大量かつ最高品質の371Bトークンを提供します。

English

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

MegaMath: オープン数学コーパスの限界に挑む

MegaMath: Pushing the Limits of Open Math Corpora

要旨

Support