FineWeb2:一統規模的處理管道——適應所有語言的預訓練數據處理
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
June 26, 2025
作者: Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf
cs.AI
摘要
預訓練最先進的大型語言模型(LLMs)需要大量乾淨且多樣化的文本數據。雖然近期在開發大型高質量英語預訓練數據集方面取得了顯著進展,但訓練性能優異的多語言LLMs仍然面臨挑戰,這在很大程度上是由於為大量語言定制過濾和去重管道的固有難度。在本研究中,我們引入了一種基於FineWeb的新型預訓練數據集整理管道,該管道可自動適應以支持任何語言。我們通過一系列有意義且信息豐富的評估任務,對九種多樣化語言進行了廣泛的管道設計選擇消融實驗,這些任務是基於可測量標準的新穎選擇過程選定的。最終,我們展示了該管道可用於創建非英語語料庫,這些語料庫產生的模型性能優於之前的數據集。此外,我們引入了一種簡單且原則性的方法來重新平衡數據集,該方法同時考慮了重複次數和質量,從而提供了額外的性能提升。最後,我們利用近100個Common Crawl快照,將該管道擴展到超過1000種語言,生成了FineWeb2,這是一個新的20TB(50億文檔)多語言數據集,我們將其與我們的管道、訓練和評估代碼庫一同發布。
English
Pre-training state-of-the-art large language models (LLMs) requires vast
amounts of clean and diverse text data. While the open development of large
high-quality English pre-training datasets has seen substantial recent
progress, training performant multilingual LLMs remains a challenge, in large
part due to the inherent difficulty of tailoring filtering and deduplication
pipelines to a large number of languages. In this work, we introduce a new
pre-training dataset curation pipeline based on FineWeb that can be
automatically adapted to support any language. We extensively ablate our
pipeline design choices on a set of nine diverse languages, guided by a set of
meaningful and informative evaluation tasks that were chosen through a novel
selection process based on measurable criteria. Ultimately, we show that our
pipeline can be used to create non-English corpora that produce more performant
models than prior datasets. We additionally introduce a straightforward and
principled approach to rebalance datasets that takes into consideration both
duplication count and quality, providing an additional performance uplift.
Finally, we scale our pipeline to over 1000 languages using almost 100 Common
Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document)
multilingual dataset which we release along with our pipeline, training, and
evaluation codebases.