대규모 데이터 선택을 위한 영향력 증류 기법

초록

효율적인 데이터 선택은 현대의 대규모 언어 모델(LLM) 훈련에 있어 핵심적인 요소입니다. 본 논문은 2차 정보를 활용하여 훈련 샘플에 최적의 가중치를 부여하는 새로운, 수학적으로 정당화된 데이터 선택 프레임워크인 Influence Distillation을 소개합니다. 각 샘플이 목표 분포에 미치는 영향을 정제함으로써, 우리의 방법은 LLM 미세 조정을 위한 훈련 데이터를 선택하는 데 사용되는 모델 특정 가중치를 할당하여 목표 도메인에서의 강력한 성능으로 이끕니다. 우리는 이러한 최적의 가중치를 Gradient Descent와 Adam 옵티마이저 모두에 대해 도출합니다. 확장성을 보장하고 계산 비용을 줄이기 위해, 우리는 랜드마크 기반 근사법을 제안합니다: 소수의 "랜드마크" 샘플에 대해 영향을 정확하게 계산한 다음, 이를 모든 다른 샘플에 효율적으로 전파하여 가중치를 결정합니다. 우리는 Influence Distillation을 Tulu V2 데이터셋에 대한 지시 튜닝에 적용하여 GSM8k, SQuAD, MMLU 등 다양한 작업을 대상으로 Llama 및 Qwen 계열의 여러 모델에서 검증합니다. 실험 결과, Influence Distillation은 최첨단 성능을 달성하거나 능가하면서 최대 3.5배 빠른 선택 속도를 달성함을 보여줍니다.

English

Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a landmark-based approximation: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5times faster selection.

대규모 데이터 선택을 위한 영향력 증류 기법

Efficient Data Selection at Scale via Influence Distillation

초록

Support