少即是多：探討在大規模預訓練LLM時的數據修剪

摘要

近年來，大量的文本數據對大型語言模型（LLMs）的發展做出了重要貢獻。這些數據通常通過網絡抓取來獲取，形成由嘈雜網絡文本組成的預訓練數據集。迄今為止，將這些數據集修剪為更高質量子集的努力依賴於手工設計的啟發式，這些啟發式被編碼為基於規則的過濾器。在這項工作中，我們採取更廣泛的視角，探索可用於系統性地衡量預訓練數據質量的可擴展估計。我們在規模上進行了嚴格的比較，比較了簡單的數據質量估計器困惑度，以及更複雜和計算密集的誤差L2-範數和記憶化估計。這些指標用於對預訓練語料庫進行排名和修剪，然後我們比較了在這些修剪數據集上訓練的LLMs。令人驚訝的是，我們發現簡單的困惑度技術勝過我們更耗費計算資源的評分方法。我們在訓練時僅使用原始訓練數據集的30％時，超越了我們的無修剪基線。我們的工作為自動精選高質量語料庫中未開發的策略奠定了基礎，並暗示大多數預訓練數據可以被刪除而保持性能。

English

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

少即是多：探討在大規模預訓練LLM時的數據修剪

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

摘要

Support