Falcon LLM 的 RefinedWeb 数据集:通过网络数据,仅使用网络数据胜过精心策划的语料库
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
June 1, 2023
作者: Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay
cs.AI
摘要
大型语言模型通常是在经过筛选的网络数据和精心策划的高质量语料库(如社交媒体对话、书籍或技术论文)的混合训练中产生的。人们认为这种策划过程对于生成性能优越、具有广泛零样本泛化能力的模型是必要的。然而,随着需要在数万亿标记上进行预训练的更大型模型的出现,策划的可扩展性以及我们是否很快会耗尽独特的高质量数据尚不清楚。与先前的观点相悖,我们展示了仅经过适当筛选和去重的网络数据就能够产生强大的模型;甚至在性能上明显优于基于The Pile训练的最先进模型。尽管经过了大量筛选,我们从网络中提取的高质量数据仍然很丰富,我们能够从CommonCrawl获取五万亿标记。我们公开发布了从我们的RefinedWeb数据集中提取的6000亿标记的样本,以及在其上训练的1.3/7.5B参数的语言模型。
English
Large language models are commonly trained on a mixture of filtered web data
and curated high-quality corpora, such as social media conversations, books, or
technical papers. This curation process is believed to be necessary to produce
performant models with broad zero-shot generalization abilities. However, as
larger models requiring pretraining on trillions of tokens are considered, it
is unclear how scalable is curation and whether we will run out of unique
high-quality data soon. At variance with previous beliefs, we show that
properly filtered and deduplicated web data alone can lead to powerful models;
even significantly outperforming models from the state-of-the-art trained on
The Pile. Despite extensive filtering, the high-quality data we extract from
the web is still plentiful, and we are able to obtain five trillion tokens from
CommonCrawl. We publicly release an extract of 600 billion tokens from our
RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.