FineWeb數據集：以大規模方式提煉出最優質的文本數據

摘要

大型語言模型（LLM）的性能在很大程度上取決於其預訓練數據集的質量和大小。然而，像Llama 3和Mixtral這樣的最先進的開放式LLM的預訓練數據集並不公開，對它們的創建方式了解甚少。在這項工作中，我們介紹了FineWeb，這是從96個Common Crawl快照中衍生出的一個包含15萬億標記的數據集，能夠產生比其他開放式預訓練數據集性能更好的LLM。為了推動對如何精心策劃高質量預訓練數據集的理解，我們仔細記錄和剔除了FineWeb中使用的所有設計選擇，包括對去重和過濾策略的深入研究。此外，我們還介紹了FineWeb-Edu，這是從FineWeb中篩選出的包含1300億標記的教育文本集合。在FineWeb-Edu上預訓練的LLM在像MMLU和ARC這樣的知識和推理密集型基準測試中表現出顯著更好的性能。除了我們的數據集外，我們還公開發布了我們的數據策劃代碼庫以及在我們的剔除實驗中訓練的所有模型。

English

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

FineWeb數據集：以大規模方式提煉出最優質的文本數據

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

摘要

Support