數學生成式人工智慧:第一部分--MathPile:一個十億標記規模的數學預訓練語料庫
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
December 28, 2023
作者: Zengzhi Wang, Rui Xia, Pengfei Liu
cs.AI
摘要
高質量、大規模的語料庫是構建基礎模型的基石。在這項工作中,我們介紹了MathPile,這是一個包含約95億標記的多樣且高質量的數學中心語料庫。在創建過程中,我們堅持“少即是多”的原則,堅信在預訓練階段,數據質量高於數量的至高無上。我們的細緻數據收集和處理工作包括一套複雜的預處理、預過濾、語言識別、清理、過濾和去重,確保了我們語料庫的高質量。此外,我們對下游基準測試集進行了數據污染檢測,以消除重複數據。我們希望我們的MathPile能夠幫助提升語言模型的數學推理能力。我們計劃開源不同版本的MathPile,並提供用於處理的腳本,以促進該領域未來的發展。
English
High-quality, large-scale corpora are the cornerstone of building foundation
models. In this work, we introduce MathPile, a diverse and
high-quality math-centric corpus comprising about 9.5 billion tokens.
Throughout its creation, we adhered to the principle of ``less is
more'', firmly believing in the supremacy of data quality over quantity, even
in the pre-training phase. Our meticulous data collection and processing
efforts included a complex suite of preprocessing, prefiltering, language
identification, cleaning, filtering, and deduplication, ensuring the high
quality of our corpus. Furthermore, we performed data contamination detection
on downstream benchmark test sets to eliminate duplicates. We hope our
MathPile can help to enhance the mathematical reasoning abilities of
language models. We plan to open-source different versions of \mathpile with
the scripts used for processing, to facilitate future developments in this
field.