数学のための生成AI：第1部 ― MathPile：10億トークンスケールの数学事前学習コーパス

要旨

高品質で大規模なコーパスは、基盤モデル構築の礎である。本研究では、約95億トークンからなる多様で高品質な数学中心のコーパス「MathPile」を紹介する。その作成過程において、我々は「少ないほど良い」という原則を貫き、事前学習段階においてもデータの量よりも質の優位性を強く信じた。入念なデータ収集と処理には、複雑な前処理、事前フィルタリング、言語識別、クリーニング、フィルタリング、重複排除が含まれており、コーパスの高品質を保証している。さらに、下流のベンチマークテストセットに対してデータ汚染検出を実施し、重複を排除した。我々のMathPileが、言語モデルの数学的推論能力の向上に寄与することを期待している。今後の発展を促進するため、処理に使用したスクリプトとともに、\mathpileの異なるバージョンをオープンソース化する予定である。

English

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

数学のための生成AI：第1部 ― MathPile：10億トークンスケールの数学事前学習コーパス

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

要旨

Support