Falcon LLM 的 RefinedWeb 資料集:利用網路資料勝過精心策劃的語料庫,僅使用網路資料
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
June 1, 2023
作者: Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay
cs.AI
摘要
大型語言模型通常是在經過篩選的網絡數據和精心策劃的高質量語料庫上進行訓練,這些語料庫包括社交媒體對話、書籍或技術論文。人們認為這種策劃過程對於生成具有廣泛零-shot泛化能力的高性能模型是必要的。然而,隨著需要在數萬億令牌上預訓練的更大型模型的出現,目前還不清楚策劃的可擴展性如何,以及我們是否很快就會耗盡獨特的高質量數據。與以往的觀點相反,我們展示出僅經過適當篩選和去重的網絡數據就能產生強大的模型;甚至在性能上明顯優於基於The Pile訓練的最新模型。儘管經過了大量篩選,我們從網絡中提取的高質量數據仍然十分豐富,我們能夠從CommonCrawl獲取五萬億令牌。我們公開發布了我們的RefinedWeb數據集中的6000億令牌提取內容,以及在此基礎上訓練的1.3/7.5B參數語言模型。
English
Large language models are commonly trained on a mixture of filtered web data
and curated high-quality corpora, such as social media conversations, books, or
technical papers. This curation process is believed to be necessary to produce
performant models with broad zero-shot generalization abilities. However, as
larger models requiring pretraining on trillions of tokens are considered, it
is unclear how scalable is curation and whether we will run out of unique
high-quality data soon. At variance with previous beliefs, we show that
properly filtered and deduplicated web data alone can lead to powerful models;
even significantly outperforming models from the state-of-the-art trained on
The Pile. Despite extensive filtering, the high-quality data we extract from
the web is still plentiful, and we are able to obtain five trillion tokens from
CommonCrawl. We publicly release an extract of 600 billion tokens from our
RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.