ChatPaper.aiChatPaper

语言模型知识蒸馏中的记忆动态机制

Memorization Dynamics in Knowledge Distillation for Language Models

January 21, 2026
作者: Jaydeep Borkar, Karan Chadha, Niloofar Mireshghallah, Yuchen Zhang, Irina-Elena Veliche, Archi Mitra, David A. Smith, Zheng Xu, Diego Garcia-Olano
cs.AI

摘要

知识蒸馏(KD)技术正被日益广泛地应用于将大型语言模型的能力迁移至小型模型,其在提升效率与实用性的同时,效果往往超越传统微调方法。除性能优势外,KD也被探索作为隐私保护机制以降低训练数据泄露风险。尽管训练数据记忆现象在标准预训练和微调场景中已得到深入研究,但其在知识蒸馏框架下的动态特性仍鲜为人知。本研究基于三大语言模型家族(Pythia、OLMo-2、Qwen-3)和三个数据集(FineWeb、Wikitext、Nemotron-CC-v2),系统探究了KD全流程中的记忆效应。我们发现:(1)蒸馏模型的训练数据记忆量显著低于标准微调(记忆减少超50%);(2)部分样本具有先天易记忆特性,在蒸馏过程中贡献了绝大部分记忆量(占比约95%以上);(3)通过基于zlib熵、KL散度和困惑度的特征组合,可在蒸馏前有效预测学生模型的记忆倾向;(4)虽然软蒸馏与硬蒸馏的总体记忆率相近,但硬蒸馏风险更高:其继承教师模型特定样本的数量是软蒸馏的2.7倍。本研究最终证明,相较于标准微调,知识蒸馏既能提升模型泛化能力,又可降低数据记忆风险。
English
Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits 2.7times more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.
PDF22February 3, 2026