スケールにおける効率的なデータ選択：影響力蒸留によるアプローチ

要旨

現代の大規模言語モデル（LLM）の効率的な訓練において、効果的なデータ選択は極めて重要である。本論文では、第二次の情報を活用して最適な訓練サンプルの重み付けを行う、数学的に正当化された新しいデータ選択フレームワーク「Influence Distillation」を提案する。本手法では、各サンプルの目標分布への影響を蒸留することで、LLMのファインチューニングに使用する訓練データを選択するためのモデル固有の重みを割り当て、目標ドメインでの高い性能を導く。我々は、勾配降下法とAdamオプティマイザの両方に対してこれらの最適な重みを導出する。スケーラビリティを確保し、計算コストを削減するために、ランドマークベースの近似を提案する：少数の「ランドマーク」サンプルに対して影響を正確に計算し、その後、その影響を効率的に他のすべてのサンプルに伝播させて重みを決定する。Influence DistillationをTulu V2データセットの指示チューニングに適用し、GSM8k、SQuAD、MMLUなどの多様なタスクを対象として、LlamaおよびQwenファミリーの複数のモデルで検証する。実験の結果、Influence Distillationは最先端の性能を達成または上回りながら、最大3.5倍の高速な選択を実現することが示された。

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

スケールにおける効率的なデータ選択：影響力蒸留によるアプローチ

Efficient Data Selection at Scale via Influence Distillation

要旨

Support