CoLoR-Filter: ターゲット言語モデル事前学習のための条件付き損失削減フィルタリング

要旨

事前学習用の高品質なデータを選択することは、言語モデルの下流タスク性能を形作る上で極めて重要です。主要な課題は、この最適なサブセットを特定することにあり、この問題は一般に扱いにくいと考えられているため、スケーラブルで効果的なヒューリスティックが必要とされます。本研究では、CoLoR-Filter（Conditional Loss Reduction Filtering）というデータ選択手法を提案します。この手法は、経験的ベイズに着想を得たアプローチを活用し、2つの補助モデルの相対的な損失値に基づいて、シンプルで計算効率の良い選択基準を導出します。モデリングの理論的根拠に加えて、CoLoR-Filterを2つの言語モデリングタスクで実証的に評価しました：（1）C4からデータを選択し、Booksへのドメイン適応を評価する場合と、（2）C4からデータを選択し、複数の下流の多肢選択式質問応答タスクを評価する場合です。我々は、より積極的にサブセットを選択する場合と、小さな補助モデルを使用して大きなターゲットモデルのためのデータを選択する場合の両方で、良好なスケーリングを示しました。一つの注目すべき結果として、150Mパラメータの補助モデルのペアを使用して選択されたCoLoR-Filterデータは、1.2Bパラメータのターゲットモデルを訓練し、25Bのランダムに選択されたトークンで訓練された1.2Bパラメータモデルと同等の性能を達成しました。この際、Booksでは25倍少ないデータ、下流タスクでは11倍少ないデータを使用しました。コード: https://github.com/davidbrandfonbrener/color-filter-olmo フィルタリングされたデータ: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

English

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

CoLoR-Filter: ターゲット言語モデル事前学習のための条件付き損失削減フィルタリング

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

要旨

Summary

Support

Support