困惑於困惑:基於困惑度的小型參考模型數據修剪
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
May 30, 2024
作者: Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul
cs.AI
摘要
在這項研究中,我們探討小型語言模型是否能夠確定大規模文本數據集的高質量子集,從而提高較大語言模型的性能。儘管現有研究表明,基於較大模型的困惑度進行修剪可以產生高質量數據,我們研究了較小模型是否可以用於基於困惑度的修剪,以及修剪如何受到正在被修剪的數據的領域組成的影響。我們證明對於多個數據集組成,基於困惑度的預訓練數據修剪可以顯著提高下游任務的性能:基於使用一億兩千五百萬參數模型計算的困惑度進行修剪,可以將三十億參數模型在下游任務的平均性能提高高達 2.04 倍,並實現預訓練步驟減少高達 1.45 倍以達到相應的基準性能。此外,我們證明,基於困惑度的數據修剪還可以在過度訓練和數據受限制的情況下提高下游性能。
English
In this work, we investigate whether small language models can determine
high-quality subsets of large-scale text datasets that improve the performance
of larger language models. While existing work has shown that pruning based on
the perplexity of a larger model can yield high-quality data, we investigate
whether smaller models can be used for perplexity-based pruning and how pruning
is affected by the domain composition of the data being pruned. We demonstrate
that for multiple dataset compositions, perplexity-based pruning of pretraining
data can significantly improve downstream task performance: pruning
based on perplexities computed with a 125 million parameter model improves the
average performance on downstream tasks of a 3 billion parameter model by up to
2.04 and achieves up to a 1.45times reduction in pretraining steps to reach
commensurate baseline performance. Furthermore, we demonstrate that such
perplexity-based data pruning also yields downstream performance gains in the
over-trained and data-constrained regimes.Summary
AI-Generated Summary