FineWeb數據集:以大規模方式提煉出最優質的文本數據
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
June 25, 2024
作者: Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf
cs.AI
摘要
大型語言模型(LLM)的性能在很大程度上取決於其預訓練數據集的質量和大小。然而,像Llama 3和Mixtral這樣的最先進的開放式LLM的預訓練數據集並不公開,對它們的創建方式了解甚少。在這項工作中,我們介紹了FineWeb,這是從96個Common Crawl快照中衍生出的一個包含15萬億標記的數據集,能夠產生比其他開放式預訓練數據集性能更好的LLM。為了推動對如何精心策劃高質量預訓練數據集的理解,我們仔細記錄和剔除了FineWeb中使用的所有設計選擇,包括對去重和過濾策略的深入研究。此外,我們還介紹了FineWeb-Edu,這是從FineWeb中篩選出的包含1300億標記的教育文本集合。在FineWeb-Edu上預訓練的LLM在像MMLU和ARC這樣的知識和推理密集型基準測試中表現出顯著更好的性能。除了我們的數據集外,我們還公開發布了我們的數據策劃代碼庫以及在我們的剔除實驗中訓練的所有模型。
English
The performance of a large language model (LLM) depends heavily on the
quality and size of its pretraining dataset. However, the pretraining datasets
for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly
available and very little is known about how they were created. In this work,
we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl
snapshots that produces better-performing LLMs than other open pretraining
datasets. To advance the understanding of how best to curate high-quality
pretraining datasets, we carefully document and ablate all of the design
choices used in FineWeb, including in-depth investigations of deduplication and
filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion
token collection of educational text filtered from FineWeb. LLMs pretrained on
FineWeb-Edu exhibit dramatically better performance on knowledge- and
reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we
publicly release our data curation codebase and all of the models trained
during our ablation experiments.Summary
AI-Generated Summary