语言模型知识蒸馏中的记忆动态
Memorization Dynamics in Knowledge Distillation for Language Models
January 21, 2026
作者: Jaydeep Borkar, Karan Chadha, Niloofar Mireshghallah, Yuchen Zhang, Irina-Elena Veliche, Archi Mitra, David A. Smith, Zheng Xu, Diego Garcia-Olano
cs.AI
摘要
知識蒸餾技術正日益廣泛地應用於將大型語言模型的能力遷移至較小模型,其在效率與實用性方面的顯著提升往往超越標準微調方法。除性能優勢外,學界亦探索將知識蒸餾作為隱私保護機制,以降低訓練數據洩露風險。儘管訓練數據記憶化現象在標準預訓練與微調情境中已獲深入研究,但其在知識蒸餾框架下的動態特性仍鮮為人知。本研究通過三類大型語言模型家族與三組數據集,系統性探討知識蒸餾流程中的記憶化現象。主要發現包括:(1) 蒸餾模型的訓練數據記憶化程度顯著低於標準微調(記憶化減少逾50%);(2) 特定樣本具有先天易記憶特性,其佔蒸餾過程記憶化總量的絕大部分(約95%以上);(3) 基於zlib熵、KL散度及困惑度等特徵,可在蒸餾前預測學生模型的記憶化傾向;(4) 軟蒸餾與硬蒸餾雖總體記憶率相近,但硬蒸餾風險更高:其繼承教師模型特異性樣本的數量是軟蒸餾的2.7倍。本研究證實,相較標準微調,知識蒸餾既能提升模型泛化能力,又可有效降低記憶化風險。
English
Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits 2.7times more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.