MixSD：混合上下文自蒸馏用于知识注入

摘要

监督微调（SFT）被广泛用于向语言模型注入新知识，但常常导致预训练能力的退化，例如推理能力和通用领域性能。我们认为，这种遗忘源于人类或外部系统提供的微调目标偏离了模型的自回归分布，迫使优化器去模仿低概率的token序列。为解决这一问题，我们提出MixSD——一种无需外部教师模型的简单方法，用于实现分布对齐的知识注入。与训练固定目标不同，MixSD通过混合基础模型自身的两个条件来动态构建监督信号：一个是观察到已注入事实的专家条件，另一个是反映模型原始先验的朴素条件。由此产生的监督序列既保留了事实学习信号，又更接近基础模型的原始分布。我们在两个合成语料库上评估了MixSD——这些语料库是为了在可控环境下研究事实回忆和算术函数习得而构建的——同时在开放域事实问答和知识编辑的标准基准上进行了测试。在多种模型规模和设置下，MixSD始终比SFT和同策略自蒸馏基线实现更好的记忆-保留权衡，在保持近乎完美的训练精度的同时，保留了基础模型高达100%的保留能力，而标准SFT仅保留1%。我们进一步证明，MixSD在基础模型下产生的监督目标具有显著更低的负对数似然，并减少了沿Fisher敏感参数方向的有害移动。这些结果表明，使监督信号与模型原生的生成分布对齐，是一种简单且有效的知识注入原则，能够缓解灾难性遗忘。

English

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.