少ないほど良い：大規模LLM事前学習におけるデータプルーニングの調査

要旨

近年、大量のテキストデータが大規模言語モデル（LLM）の発展に大きく貢献してきました。このデータは通常、インターネットをスクレイピングすることで取得され、ノイズの多いウェブテキストからなる事前学習データセットが構築されます。これまで、これらのデータセットを高品質なサブセットに絞り込む取り組みは、ルールベースのフィルタとしてエンコードされた手作りのヒューリスティックに依存してきました。本研究では、より広い視点から、事前学習データの品質を体系的に測定するためのスケーラブルな品質推定手法を探求します。私たちは、シンプルなデータ品質推定器であるパープレキシティと、より洗練され計算コストの高いError L2-Normおよび記憶化の推定手法を大規模に比較します。これらのメトリクスを使用して事前学習コーパスをランク付けし、絞り込みを行い、その後、これらの絞り込まれたデータセットで訓練されたLLMを比較します。驚くべきことに、パープレキシティというシンプルな手法が、より計算コストの高いスコアリング手法を上回ることがわかりました。私たちは、元の訓練データセットのわずか30%を使用して訓練しながら、絞り込みを行わないベースラインを改善しました。本研究は、高品質なコーパスを自動的にキュレートするための未開拓の戦略の基盤を築き、性能を維持しながら事前学習データの大部分を削除できる可能性を示唆しています。

English

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

少ないほど良い：大規模LLM事前学習におけるデータプルーニングの調査

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

要旨

Support