Effiziente Datenauswahl im großen Maßstab durch Einflussdestillation

papers.abstract

Eine effektive Datenauswahl ist entscheidend für das effiziente Training moderner Large Language Models (LLMs). Dieses Paper stellt Influence Distillation vor, einen neuartigen, mathematisch fundierten Rahmen für die Datenauswahl, der Informationen zweiter Ordnung nutzt, um Trainingsbeispiele optimal zu gewichten. Indem der Einfluss jedes Beispiels auf eine Zielverteilung destilliert wird, weist unsere Methode modellspezifische Gewichte zu, die zur Auswahl von Trainingsdaten für das Fine-Tuning von LLMs verwendet werden, um eine starke Leistung im Zielbereich zu erzielen. Wir leiten diese optimalen Gewichte sowohl für den Gradient Descent- als auch für den Adam-Optimierer ab. Um Skalierbarkeit zu gewährleisten und die Rechenkosten zu reduzieren, schlagen wir eine land-markenbasierte Approximation vor: Der Einfluss wird präzise für eine kleine Teilmenge von „Landmarken“-Beispielen berechnet und dann effizient auf alle anderen Beispiele übertragen, um deren Gewichte zu bestimmen. Wir validieren Influence Distillation, indem wir es auf das Instruction Tuning des Tulu V2-Datensatzes anwenden, wobei eine Reihe von Aufgaben wie GSM8k, SQuAD und MMLU über mehrere Modelle der Llama- und Qwen-Familien abgedeckt werden. Experimente zeigen, dass Influence Distillation die state-of-the-art Leistung erreicht oder übertrifft und dabei eine bis zu 3,5-fach schnellere Auswahl ermöglicht.

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

Effiziente Datenauswahl im großen Maßstab durch Einflussdestillation

Efficient Data Selection at Scale via Influence Distillation

papers.abstract

Support