自蒸馏赋能持续学习

摘要

持续学习旨在使模型能够获取新技能与知识而不损害现有能力，这始终是基础模型面临的核心挑战。虽然同策略强化学习可缓解遗忘问题，但其需要显式奖励函数作为支撑，而这类函数往往难以获取。基于专家示范的学习作为主要替代方案，目前以监督微调为主导，但该方法本质属于异策略学习。我们提出自蒸馏微调法——一种可直接从示范数据中实现同策略学习的简洁方法。该方法通过将示范条件化模型作为自身的教师模型，利用上下文学习能力生成同策略训练信号，从而在掌握新技能的同时保持原有能力。在技能学习与知识获取任务中，SDFT持续超越监督微调，不仅获得更高的新任务准确率，更显著降低灾难性遗忘。序列学习实验表明，SDFT能使单一模型随时间推移持续积累多项技能且不发生性能衰退，由此确立了基于示范的同策略蒸馏作为持续学习的可行路径。

English

Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

自蒸馏赋能持续学习

Self-Distillation Enables Continual Learning

摘要

Support