反思选择性知识蒸馏
Rethinking Selective Knowledge Distillation
February 1, 2026
作者: Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva
cs.AI
摘要
当前,提升大语言模型知识蒸馏效果的研究正从密集的教师监督转向选择性蒸馏策略,即仅对部分词元位置、词汇类别或训练样本进行监督。然而,关于何种重要性信号、选择策略及其相互作用最为有效仍不明确。本研究重新审视了自回归大语言模型中知识蒸馏的位置与方式,沿位置、类别和样本三个维度解构选择性知识蒸馏,系统比较了重要性信号与选择策略。基于此分析,我们发现了尚未充分探索的优化空间,提出了基于学生熵指导的位置选择方法(SE-KD)。在一系列基准测试中,SE-KD相比密集蒸馏方法往往能提升模型精度、下游任务适配性及内存效率。将该方法扩展至类别与样本维度(SE-KD 3X)可产生互补性效率增益,使得离线教师缓存成为可能。实际应用中,相较于现有方法,该方案在保持性能不变的同时将训练时间缩短70%,峰值内存降低18%,存储占用减少80%。
English
Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.