当少即是多：探讨在大规模预训练语言模型中进行数据修剪

摘要

近年来，大量文本数据对大型语言模型（LLMs）的发展产生了重要影响。这些数据通常通过抓取互联网而获得，导致预训练数据集由嘈杂的网络文本组成。迄今为止，将这些数据集精简为更高质量子集的工作一直依赖于手工制定的启发式规则过滤器。在这项工作中，我们采取更广泛的视角，探索可用于系统地衡量预训练数据质量的可扩展估计方法。我们在规模上进行了严格比较，包括简单数据质量估计器困惑度，以及更复杂和计算密集的误差L2范数和记忆估计。这些指标用于对预训练语料库进行排名和精简，随后我们比较了在这些精简数据集上训练的LLMs。令人惊讶的是，我们发现简单的困惑度技术胜过了更昂贵的评分方法。我们在仅使用原始训练数据集的30%进行训练时，改善了我们的无精简基线。我们的工作为自动筛选高质量语料库中未开发的策略奠定了基础，并暗示大多数预训练数据可以被移除而保持性能。

English

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

当少即是多：探讨在大规模预训练语言模型中进行数据修剪

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

摘要

Support