MixSD: 혼합 맥락적 자기 증류를 통한 지식 주입

초록

지도 미세 조정(Supervised Fine-Tuning, SFT)은 언어 모델에 새로운 지식을 주입하기 위해 널리 사용되지만, 종종 추론 능력과 일반 도메인 성능과 같은 사전 훈련된 능력을 저하시킵니다. 우리는 이러한 망각이 인간 또는 외부 시스템에서 비롯된 미세 조정 목표가 모델의 자기회귀 분포와 괴리되어, 옵티마이저가 낮은 확률의 토큰 시퀀스를 모방하도록 강제하기 때문에 발생한다고 주장합니다. 이 문제를 해결하기 위해, 우리는 간단한 외부 교사 없는 분포 정렬 지식 주입 방법인 MixSD를 제안합니다. MixSD는 고정된 목표를 학습하는 대신, 기본 모델 자체의 두 조건부, 즉 주입된 사실을 문맥에서 관찰하는 전문가 조건부와 모델의 원래 사전 지식을 반영하는 순수 조건부에서 토큰을 혼합하여 동적으로 지도 신호를 구성합니다. 결과적으로 생성된 지도 시퀀스는 사실 학습 신호를 유지하면서도 기본 모델의 분포에 훨씬 더 가깝게 유지됩니다. 우리는 통제된 환경에서 사실 회상 및 산술 함수 습득을 연구하기 위해 자체 구축한 두 개의 합성 말뭉치와 함께, 개방형 도메인 사실 질의응답 및 지식 편집을 위한 기존 벤치마크에서 MixSD를 평가합니다. 여러 모델 규모와 설정에 걸쳐, MixSD는 SFT 및 온-정책 자기 증류 기준선에 비해 더 나은 암기-유지 균형을 일관되게 달성하며, 기본 모델의 미보유 능력을 최대 100% 유지하면서도 거의 완벽한 훈련 정확도를 유지합니다. 반면, 표준 SFT는 1%만 유지합니다. 우리는 또한 MixSD가 기본 모델 하에서 실질적으로 더 낮은 NLL 지도 목표를 생성하고, Fisher 민감 매개변수 방향을 따라 유해한 이동을 줄인다는 것을 보여줍니다. 이러한 결과는 지도 신호를 모델의 고유 생성 분포에 맞추는 것이 파괴적 망각을 완화하는 지식 주입을 위한 간단하면서도 효과적인 원칙임을 시사합니다.

English

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.