透過影響力蒸餾實現大規模高效數據選擇

摘要

有效的數據選擇對於現代大型語言模型（LLMs）的高效訓練至關重要。本文介紹了影響蒸餾（Influence Distillation），這是一種新穎且數學上合理的數據選擇框架，利用二階信息來最優化地加權訓練樣本。通過蒸餾每個樣本對目標分佈的影響，我們的方法分配了模型特定的權重，用於選擇LLM微調的訓練數據，引導其在目標領域上實現強勁性能。我們為梯度下降（Gradient Descent）和Adam優化器推導了這些最優權重。為了確保可擴展性並降低計算成本，我們提出了一種基於地標的近似方法：精確計算一小部分“地標”樣本的影響，然後高效地傳播到所有其他樣本以確定其權重。我們通過在Tulu V2數據集上應用影響蒸餾進行指令微調，針對包括GSM8k、SQuAD和MMLU在內的多項任務，並在多個Llama和Qwen家族的模型上進行了驗證。實驗表明，影響蒸餾在匹配或超越最先進性能的同時，實現了高達3.5倍的選擇速度提升。

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

透過影響力蒸餾實現大規模高效數據選擇

Efficient Data Selection at Scale via Influence Distillation

摘要

Support