Efficiënte Dataselectie op Schaal via Invloedsdistillatie

Samenvatting

Effectieve dataselectie is cruciaal voor efficiënte training van moderne Large Language Models (LLMs). Dit artikel introduceert Influence Distillation, een nieuw, wiskundig onderbouwd raamwerk voor dataselectie dat tweede-orde informatie gebruikt om trainingsmonsters optimaal te wegen. Door de invloed van elk monster op een doeldistributie te distilleren, wijst onze methode modelspecifieke gewichten toe die worden gebruikt om trainingsdata te selecteren voor fine-tuning van LLMs, waardoor deze naar sterke prestaties in het doeldomein worden geleid. We leiden deze optimale gewichten af voor zowel Gradient Descent- als Adam-optimalisatoren. Om schaalbaarheid te garanderen en de rekenkosten te verlagen, stellen we een landmark-gebaseerde benadering voor: de invloed wordt precies berekend voor een kleine subset van "landmark"-monsters en vervolgens efficiënt doorgevoerd naar alle andere monsters om hun gewichten te bepalen. We valideren Influence Distillation door het toe te passen op instructietuning van de Tulu V2-dataset, gericht op een reeks taken waaronder GSM8k, SQuAD en MMLU, voor verschillende modellen uit de Llama- en Qwen-families. Experimenten tonen aan dat Influence Distillation de state-of-the-art prestaties evenaart of overtreft, terwijl het tot 3,5 keer snellere selectie bereikt.

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

Efficiënte Dataselectie op Schaal via Invloedsdistillatie

Efficient Data Selection at Scale via Influence Distillation

Samenvatting

Support