수학을 위한 생성형 AI: 제1부 — MathPile: 수학을 위한 10억 토큰 규모의 사전 학습 코퍼스

초록

고품질의 대규모 코퍼스는 기초 모델 구축의 초석입니다. 본 연구에서는 약 95억 개의 토큰으로 구성된 다양하고 고품질의 수학 중심 코퍼스인 MathPile을 소개합니다. 이를 구축하는 과정에서 우리는 "적은 것이 더 많다"는 원칙을 준수하며, 사전 학습 단계에서도 데이터의 양보다 질이 우선한다는 믿음을 견지했습니다. 우리의 세심한 데이터 수집 및 처리 작업은 복잡한 전처리, 사전 필터링, 언어 식별, 정제, 필터링, 중복 제거 등의 과정을 포함하여 코퍼스의 높은 품질을 보장했습니다. 더 나아가, 하위 벤치마크 테스트 세트에 대한 데이터 오염 검출을 수행하여 중복을 제거했습니다. 우리는 MathPile이 언어 모델의 수학적 추론 능력을 향상시키는 데 도움이 되기를 바랍니다. 우리는 이 분야의 미래 발전을 촉진하기 위해 처리에 사용된 스크립트와 함께 MathPile의 다양한 버전을 오픈소스로 공개할 계획입니다.

English

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

수학을 위한 생성형 AI: 제1부 — MathPile: 수학을 위한 10억 토큰 규모의 사전 학습 코퍼스

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

초록

Support