FineWeb2：一统天下的数据处理管道——适应多语言预训练的数据处理方案

摘要

预训练最先进的大型语言模型（LLMs）需要大量干净且多样化的文本数据。尽管大规模高质量英语预训练数据集的开放开发近期取得了显著进展，但训练性能优异的多语言LLMs仍面临挑战，这在很大程度上归因于为众多语言定制过滤和去重管道的固有难度。在本研究中，我们引入了一种基于FineWeb的新型预训练数据集构建管道，该管道可自动适应支持任何语言。我们在一组九种多样化的语言上深入分析了管道设计选择，这些选择由一系列有意义且信息丰富的评估任务指导，这些任务通过基于可测量标准的新颖选择过程确定。最终，我们展示了该管道可用于创建非英语语料库，相比以往数据集，这些语料库能训练出性能更优的模型。此外，我们提出了一种简单且原则性的数据集再平衡方法，该方法同时考虑了重复次数和质量，从而进一步提升了模型性能。最后，我们利用近100个Common Crawl快照将管道扩展至1000多种语言，生成了FineWeb2，这是一个新的20TB（50亿文档）多语言数据集，我们随同管道、训练和评估代码库一并发布。

English

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

FineWeb2：一统天下的数据处理管道——适应多语言预训练的数据处理方案

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

摘要

Support