FineWeb 데이터셋: 대규모 웹 데이터에서 최고의 텍스트 데이터를 추출하기

초록

대규모 언어 모델(LLM)의 성능은 사전 학습 데이터셋의 품질과 규모에 크게 좌우됩니다. 그러나 Llama 3 및 Mixtral과 같은 최첨단 오픈 LLM의 사전 학습 데이터셋은 공개되어 있지 않으며, 이들이 어떻게 생성되었는지에 대한 정보도 거의 알려져 있지 않습니다. 본 연구에서는 96개의 Common Crawl 스냅샷에서 도출된 15조 토큰 규모의 FineWeb 데이터셋을 소개합니다. 이 데이터셋은 기존의 공개된 사전 학습 데이터셋보다 더 우수한 성능의 LLM을 생성합니다. 고품질 사전 학습 데이터셋을 어떻게 최적으로 구성할지에 대한 이해를 높이기 위해, FineWeb에서 사용된 모든 설계 선택 사항을 신중하게 문서화하고, 중복 제거 및 필터링 전략에 대한 심층적인 조사를 포함하여 이를 철저히 분석했습니다. 또한, FineWeb에서 필터링된 교육용 텍스트로 구성된 1.3조 토큰 규모의 FineWeb-Edu 컬렉션을 소개합니다. FineWeb-Edu로 사전 학습된 LLM은 MMLU 및 ARC와 같은 지식 및 추론 집중형 벤치마크에서 극적으로 향상된 성능을 보여줍니다. 데이터셋과 함께, 우리는 데이터 큐레이션 코드베이스와 분석 실험 중에 훈련된 모든 모델을 공개합니다.

English

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

FineWeb 데이터셋: 대규모 웹 데이터에서 최고의 텍스트 데이터를 추출하기

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

초록

Support