大規模數據選擇用於指令微調

摘要

從更大的數據池中選取高品質的訓練數據，是指令微調語言模型時的一個關鍵步驟，因為精心策劃的數據集往往能訓練出比那些在更大、更嘈雜數據集上訓練的模型更優異的表現。自動化數據選擇方法在指令微調中的測試通常涉及從小型數據池（約10萬至20萬樣本）中選取小規模數據集（約1萬樣本）。然而，實際部署中廣受歡迎的指令微調模型往往基於數十萬至數百萬的樣本進行訓練，這些樣本又是從更龐大的數據池中抽取的。我們系統地研究了數據選擇方法在這些場景下的擴展能力，從最多580萬樣本的數據池中選取最多250萬樣本，並在7個多樣化的任務上進行評估。我們發現，許多近期提出的方法在這種情況下甚至不如隨機選擇（且消耗更多計算資源），當面對更大的數據池進行選擇時，其性能反而下降。然而，我們發現一種基於表徵的數據選擇變體（RDS+），它利用預訓練語言模型隱藏狀態的加權平均池化，在所有測試場景中均一致地超越了更複雜的方法——同時還更為計算高效。我們的研究結果強調，應更密切地審視所提出的自動化選擇方法的擴展特性。我們在https://github.com/hamishivi/automated-instruction-selection 上公開了我們的代碼、數據和模型。

English

Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.

大規模數據選擇用於指令微調

Large-Scale Data Selection for Instruction Tuning

摘要

Support