희소 로짓 샘플링: 대형 언어 모델의 지식 증류 가속화

초록

지식 증류(Knowledge Distillation)는 대규모 언어 모델에서 지식을 추출하는 데 비용 효율적인 기술이 될 수 있으며, 특히 교사 모델의 출력 로짓(logits)을 사전 계산하여 캐싱할 수 있는 경우에 그렇습니다. 그러나 이를 사전 학습에 성공적으로 적용하는 것은 아직까지 크게 탐구되지 않은 영역입니다. 본 연구에서는 Top-K 확률을 캐싱하는 것과 같은 직관적인 희소 지식 증류 방법이 교사 모델의 확률 분포를 학생 모델에 편향적으로 추정하게 되어 최적의 성능과 보정(calibration)을 달성하지 못한다는 것을 입증합니다. 우리는 중요도 샘플링(importance sampling) 기반의 방법인 '랜덤 샘플링 지식 증류(Random Sampling Knowledge Distillation)'를 제안합니다. 이 방법은 편향되지 않은 추정치를 제공하고, 기대값에서 그래디언트를 보존하며, 훨씬 더 희소한 로짓만 저장하면 됩니다. 우리의 방법은 교차 엔트로피(cross-entropy) 기반 학습에 비해 학생 모델의 학습 속도를 크게 향상시키면서도(10% 미만의 오버헤드), 300M에서 3B에 이르는 다양한 모델 크기에서 완전한 지식 증류와 비교해도 경쟁력 있는 성능을 유지합니다.

English

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

희소 로짓 샘플링: 대형 언어 모델의 지식 증류 가속화

Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

초록

Support