ChatPaper.aiChatPaper

重访在线蒸馏策略:实证失效模式与简易修正方案

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

March 26, 2026
作者: Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao
cs.AI

摘要

同策略蒸馏(OPD)因其能基于学生模型生成的轨迹而非固定的教师轨迹进行反馈评估,在大语言模型(LLM)后训练中备受关注。然而在长序列场景下,常用的逐词采样变体存在脆弱性:它将分布匹配简化为单步信号,且随着生成轨迹偏离教师模型的常见前缀,其可靠性会逐渐降低。我们从估计器设计与实现层面重新审视了OPD方法。理论分析表明,逐词级别的OPD相对于序列级反向KL散度存在偏差,但其最坏情况方差边界更紧;我们的模拟实验验证了该权衡关系——未来奖励关联性越强,梯度方差越大,学习稳定性越差。实证研究中,我们识别出采样词OPD的三大失效模式:失衡的单步信号、学生生成前缀上不可靠的教师指导,以及分词器或特殊词符失配导致的失真。针对这些问题,我们提出教师Top-K局部支持匹配方法,通过带截断的反向KL散度实现,结合Top-P轨迹采样与特殊词符掩码技术。在单任务数学推理与多任务(智能体+数学)训练中,该目标函数相比采样词OPD实现了更稳定的优化效果与更优的下游性能。
English
On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.
PDF21March 28, 2026