ChatPaper.aiChatPaper

数学生成式人工智能:第一部分 -- MathPile:一个数十亿标记规模的数学预训练语料库

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

December 28, 2023
作者: Zengzhi Wang, Rui Xia, Pengfei Liu
cs.AI

摘要

高质量、大规模的语料库是构建基础模型的基石。在这项工作中,我们介绍了MathPile,这是一个包含约95亿标记的多样化且高质量的数学中心语料库。在创建过程中,我们坚持“少即是多”的原则,坚信数据质量在预训练阶段甚至比数量更重要。我们进行了细致的数据收集和处理工作,包括一系列复杂的预处理、预过滤、语言识别、清洗、过滤和去重,确保了我们语料库的高质量。此外,我们对下游基准测试集进行了数据污染检测,以消除重复数据。我们希望我们的MathPile能够帮助提升语言模型的数学推理能力。我们计划开源不同版本的MathPile,并提供用于处理的脚本,以促进这一领域未来的发展。
English
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.
PDF2711December 15, 2024