ChatPaper.aiChatPaper

反思选择性知识蒸馏

Rethinking Selective Knowledge Distillation

February 1, 2026
作者: Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva
cs.AI

摘要

为提升大型语言模型知识蒸馏效果的研究日益增多,这些研究采用选择性蒸馏策略替代传统的密集教师监督机制,即仅选取部分词元位置、词汇类别或训练样本进行监督。然而,关于何种重要性信号、选择策略及其相互作用最为有效的问题仍不明确。本研究重新审视了自回归大型语言模型中知识蒸馏的定位与实施方法:我们分别从位置轴、类别轴和样本轴三个维度解构选择性知识蒸馏,系统比较了各类重要性信号与选择策略。基于此分析框架,我们发现了尚未充分探索的优化空间,并提出基于学生熵指导的位置选择方法(SE-KD)。在系列基准测试中,SE-KD相较于密集蒸馏在准确率、下游任务契合度和内存效率方面均展现出优势。将该方法扩展至类别轴与样本轴形成的三维框架(SE-KD 3X)可产生互补性效率增益,使得离线教师缓存方案具备可行性。实际应用中,该方法在保持性能不变的前提下,较现有技术将训练时间缩短70%,峰值内存降低18%,存储空间占用减少80%。
English
Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
PDF226February 7, 2026