CoLoR-Filter:針對目標語言模型預訓練的條件損失減少過濾器
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
June 15, 2024
作者: David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, Sham Kakade
cs.AI
摘要
在塑造語言模型下游任務表現方面,選擇高質量的預訓練數據至關重要。一個主要挑戰在於識別這個最佳子集,這個問題通常被認為是棘手的,因此需要可擴展且有效的啟發式方法。在這項工作中,我們提出了一種數據選擇方法,稱為 CoLoR-Filter(Conditional Loss Reduction Filtering),它利用一種基於經驗貝葉斯的方法,根據兩個輔助模型的相對損失值來推導一個簡單且計算效率高的選擇標準。
除了建模原理外,我們還在兩個語言建模任務上對 CoLoR-Filter 進行了實證評估:(1)從 C4 中選擇數據以進行領域適應,以在 Books 上進行評估,以及(2)從 C4 中選擇數據以應用於一系列下游多選問答任務。我們展示了在更積極地子選擇數據以及使用小型輔助模型為大型目標模型選擇數據時的有利擴展性。作為一個重要結果,使用一對擁有 150m 參數的輔助模型選擇的 CoLoR-Filter 數據,可以訓練一個擁有 1.2b 參數的目標模型,使其與使用 25b 隨機選擇的標記訓練的 1.2b 參數模型在 Books 上的數據量少 25 倍,在下游任務上的數據量少 11 倍。
代碼:https://github.com/davidbrandfonbrener/color-filter-olmo
過濾後的數據:https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
English
Selecting high-quality data for pre-training is crucial in shaping the
downstream task performance of language models. A major challenge lies in
identifying this optimal subset, a problem generally considered intractable,
thus necessitating scalable and effective heuristics. In this work, we propose
a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering),
which leverages an empirical Bayes-inspired approach to derive a simple and
computationally efficient selection criterion based on the relative loss values
of two auxiliary models.
In addition to the modeling rationale, we evaluate CoLoR-Filter empirically
on two language modeling tasks: (1) selecting data from C4 for domain
adaptation to evaluation on Books and (2) selecting data from C4 for a suite of
downstream multiple-choice question answering tasks. We demonstrate favorable
scaling both as we subselect more aggressively and using small auxiliary models
to select data for large target models. As one headline result, CoLoR-Filter
data selected using a pair of 150m parameter auxiliary models can train a 1.2b
parameter target model to match a 1.2b parameter model trained on 25b randomly
selected tokens with 25x less data for Books and 11x less data for the
downstream tasks.
Code: https://github.com/davidbrandfonbrener/color-filter-olmo
Filtered data:
https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4Summary
AI-Generated Summary