通过影响力蒸馏实现大规模高效数据筛选

摘要

高效的数据筛选对于现代大规模语言模型（LLMs）的训练至关重要。本文提出了影响力蒸馏（Influence Distillation），一种新颖且数学上严谨的数据筛选框架，该框架利用二阶信息为训练样本赋予最优权重。通过蒸馏每个样本对目标分布的影响力，我们的方法分配了模型特定的权重，用于选择LLM微调的训练数据，引导其在目标领域上实现强劲性能。我们为梯度下降和Adam优化器推导了这些最优权重。为确保可扩展性并降低计算成本，我们提出了一种基于地标样本的近似方法：精确计算一小部分“地标”样本的影响力，然后高效地传播至所有其他样本以确定其权重。我们通过在Tulu V2数据集上进行指令微调，针对包括GSM8k、SQuAD和MMLU在内的多种任务，对Llama和Qwen系列中的多个模型应用影响力蒸馏进行了验证。实验结果表明，影响力蒸馏不仅匹配甚至超越了现有最佳性能，同时实现了高达3.5倍的筛选速度提升。

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

通过影响力蒸馏实现大规模高效数据筛选

Efficient Data Selection at Scale via Influence Distillation

摘要

Support