MixSD: 混合コンテキスト自己蒸留による知識注入

要旨

教師ありファインチューニング（SFT）は、言語モデルに新しい知識を注入するために広く用いられているが、推論や汎用ドメイン性能などの事前学習済みの能力をしばしば低下させる。我々は、この忘却が、人間や外部システムからのファインチューニングターゲットがモデルの自己回帰分布から乖離し、オプティマイザが低確率なトークン系列を模倣せざるを得なくなることに起因すると論じる。この問題に対処するため、我々はMixSDを提案する。これは、分布に整合した知識注入のための、シンプルで外部教師を必要としない手法である。MixSDは固定目標に対する学習を行う代わりに、ベースモデル自身の2つの条件付き分布からトークンを混合することで動的に教師信号を構築する。すなわち、注入された事実をコンテキストで観測する専門家条件付き分布と、モデルの元の事前分布を反映するナイーブ条件付き分布である。得られた教師信号系列は、事実学習信号を保持しつつ、ベースモデルの分布に大幅に近い状態を維持する。我々は、管理された設定で事実想起と算術関数の獲得を研究するために構築した2つの合成コーパス、およびオープンドメイン事実質問応答と知識編集に関する確立されたベンチマークを用いてMixSDを評価する。複数のモデルスケールと設定にわたり、MixSDはSFTやオンポリシーの自己蒸留ベースラインと比較して、一貫してより優れた記憶保持のトレードオフを達成する。ほぼ完全な学習精度を維持しながらベースモデルの保持能力の最大100%を保持するのに対し、標準的なSFTはわずか1%しか保持しない。さらに、MixSDはベースモデル下で大幅に低い負の対数尤度（NLL）の教師信号ターゲットを生成し、フィッシャー情報量に敏感なパラメータ方向への有害な移動を低減することを示す。これらの結果は、教師信号をモデルの本来の生成分布に整合させることが、破滅的忘却を軽減する知識注入のシンプルかつ効果的な原理であることを示唆している。

English

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.