データ効率の良い大規模言語モデルの訓練方法

要旨

大規模言語モデル（LLM）の訓練は高コストである。本論文では、LLMの事前学習におけるデータ効率の良いアプローチ、すなわちモデルの品質と訓練リソース/データ消費量のパレート最適化を目指す技術を研究する。我々は、(i)計算コストが高いデータ品質推定に基づくデータ選択手法と、(ii)特徴空間におけるカバレッジと多様性に基づく指標の最大化に関連するトレードオフを理解することを目指す。最初の手法であるAsk-LLMは、指示チューニングされたLLMのゼロショット推論能力を活用して、訓練データの品質を直接評価する。カバレッジをターゲットとするために、データ分布をモデル化して多様なサンプルを選択するDensityサンプリングを提案する。19のサンプリング手法を比較し、数百の評価タスクと事前学習の実行を通じて、Ask-LLMとDensityがそれぞれのカテゴリーで最良の手法であることを発見した。カバレッジサンプリングは、全データの性能を回復することが可能であり、Ask-LLMデータで訓練されたモデルは、元のデータセットの90%を棄却した場合でも、全データ訓練を一貫して上回り、最大70%速く収束する。

English

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

データ効率の良い大規模言語モデルの訓練方法

How to Train Data-Efficient LLMs

要旨

Support