ChatPaper.aiChatPaper

困惑于困惑:基于困惑度的数据修剪与小型参考模型

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

May 30, 2024
作者: Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul
cs.AI

摘要

在这项工作中,我们研究了小型语言模型是否能够确定大规模文本数据集的高质量子集,从而提高较大语言模型的性能。尽管现有研究表明,基于较大模型困惑度的修剪可以产生高质量数据,但我们研究了较小模型是否可以用于基于困惑度的修剪,以及修剪如何受到被修剪数据的领域组成的影响。我们证明,对于多个数据集组成,基于困惑度的预训练数据修剪可以显著提高下游任务的性能:基于一个1.25亿参数模型计算的困惑度进行修剪,可以将一个30亿参数模型在下游任务的平均性能提高高达2.04倍,并且可以实现预训练步骤的减少,以达到相当的基准性能,减少高达1.45倍。此外,我们证明,基于困惑度的数据修剪还可以在过度训练和数据受限制的情况下提高下游性能。
English
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can significantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45times reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

Summary

AI-Generated Summary

PDF241December 12, 2024