CoLoR-Filter：用于目标语言模型预训练的条件损失减少过滤器

摘要

在塑造语言模型下游任务性能方面，为预训练选择高质量数据至关重要。一个主要挑战在于确定这个最佳子集，这个问题通常被认为是难以解决的，因此需要可扩展且有效的启发式方法。在这项工作中，我们提出了一种数据选择方法，即CoLoR-Filter（条件损失减少过滤），它利用经验贝叶斯启发式方法，基于两个辅助模型的相对损失值，提出了一个简单且计算高效的选择标准。除了建模原理，我们还在两个语言建模任务上对CoLoR-Filter进行了实证评估：（1）从C4中选择数据，用于领域自适应，以在Books上进行评估；（2）从C4中选择数据，用于一系列下游多项选择问答任务。我们展示了在更积极地子选择更多数据和使用小型辅助模型为大型目标模型选择数据时的有利扩展性。作为一个重要结果，使用一对拥有1.5亿参数的辅助模型选择的CoLoR-Filter数据，可以训练一个拥有12亿参数的目标模型，使其与使用25亿随机选择的tokens训练的拥有12亿参数的模型在Books上的数据量减少25倍，下游任务减少11倍。代码：https://github.com/davidbrandfonbrener/color-filter-olmo 过滤后的数据：https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

English

Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

CoLoR-Filter：用于目标语言模型预训练的条件损失减少过滤器

CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

摘要

Support