FineWebデータセット：ウェブから最高品質のテキストデータを大規模に抽出

要旨

大規模言語モデル（LLM）の性能は、その事前学習データセットの品質と規模に大きく依存します。しかし、Llama 3やMixtralのような最先端のオープンLLMの事前学習データセットは公開されておらず、それらがどのように作成されたかについてもほとんど知られていません。本研究では、96のCommon Crawlスナップショットから導出された15兆トークンのデータセットであるFineWebを紹介します。FineWebは、他のオープンな事前学習データセットよりも優れた性能を持つLLMを生成します。高品質な事前学習データセットをどのように最適にキュレーションするかについての理解を深めるため、FineWebで使用されたすべての設計選択を注意深く文書化し、アブレーションを行いました。これには、重複排除やフィルタリング戦略に関する詳細な調査も含まれます。さらに、FineWebからフィルタリングされた教育テキストの1.3兆トークンのコレクションであるFineWeb-Eduを紹介します。FineWeb-Eduで事前学習されたLLMは、MMLUやARCのような知識および推論を要するベンチマークで劇的に優れた性能を示します。私たちのデータセットとともに、データキュレーションのコードベースやアブレーション実験中にトレーニングされたすべてのモデルを公開します。

English

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

FineWebデータセット：ウェブから最高品質のテキストデータを大規模に抽出

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

要旨

Support