MixSD：混合上下文自我蒸餾用於知識注入

摘要

監督微調（SFT）廣泛用於將新知識注入語言模型，但常會損害預訓練能力，如推理與通用領域表現。我們認為此遺忘現象源於人類或外部系統提供的微調目標偏離模型的自迴歸分布，迫使優化器模仿低機率令牌序列。為解決此問題，我們提出MixSD——一種無需外部教師的簡易分布對齊知識注入方法。MixSD並非基於固定目標進行訓練，而是透過動態混合基礎模型自身的兩個條件變量來建構監督信號：一個是觀察注入事實的專家條件變量，另一個是反映模型原始先驗的樸素條件變量。所產生的監督序列既保留事實學習信號，又顯著更接近基礎模型的分布。我們在兩個自建合成語料庫（用於控制環境下研究事實回憶與算術函數習得）以及開放域事實問答與知識編輯的既有基準上評估MixSD。跨越多種模型規模與設定，MixSD在記憶保留取捨上持續優於SFT與在策略自我蒸餾基線，能保留基礎模型高達100%的保留能力，同時維持近乎完美的訓練準確率，而標準SFT僅保留1%。我們進一步證明，MixSD在基礎模型下產生顯著更低負對數似然的監督目標，並減少沿費雪敏感參數方向的有害移動。這些結果表明，將監督信號與模型原生生成分布對齊，是減輕災難性遺忘的簡單有效知識注入原則。

English

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.