DataDecide: 小規模実験で最適な事前学習データを予測する方法

要旨

大規模言語モデルの異なるデータセットでの事前学習は高コストであるため、データを決定するために小規模な実験を活用することはコスト削減において重要です。小規模での観測された性能から意思決定を行うためのベンチマークや手法のうち、どのものが大規模モデルに最適なデータセットを最も正確に予測するのでしょうか？この問いをオープンに探求するために、私たちはDataDecideをリリースしました。これは、データとスケールの違いにわたる最も包括的なオープンなモデルスイートであり、モデル、データ、評価を提供します。私たちは、25のコーパスにわたる制御された事前学習実験を行い、異なるソース、重複除去、フィルタリングを施した最大100Bトークンのデータ、最大1Bパラメータのモデルサイズ、および3つのランダムシードを使用しました。その結果、単一の小規模サイズ（例えば150Mパラメータ）でのモデルの順位付けは、私たちの目標とする大規模（1B）での最良のモデルを予測するための強力なベースラインであることがわかりました（約80%の比較が正しい）。8つのベースラインの中でも、スケーリング法則の手法は単一スケール予測の計算意思決定フロンティアを超えるものはありませんでしたが、DataDecideは将来のスケーリング法則の改善を測定することができます。また、小規模実験で連続的な尤度メトリクスを代理指標として使用することで、MMLU、ARC、HellaSwag、MBPP、HumanEvalなどのベンチマークが、目標の1Bスケールでわずか0.01%の計算量で80%以上予測可能であることも明らかにしました。

English

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide: 小規模実験で最適な事前学習データを予測する方法

DataDecide: How to Predict Best Pretraining Data with Small Experiments

要旨

Support