重探在線蒸餾法：實證失效模式與簡易修正方案

摘要

同策略蒸馏（OPD）因其基于学生模型生成的状态轨迹而非固定的教师轨迹进行反馈评估，在大语言模型（LLM）后训练中展现出显著优势。然而在长程建模场景下，常用的逐词采样变体存在脆弱性：它将分布匹配简化为单步信号，且随着生成序列偏离教师模型的常见前缀，其可靠性会持续下降。本文从估计器设计与实现层面重新审视OPD：理论分析表明，相较于序列级反向KL散度，词级OPD存在偏差，但其最坏情况方差边界更紧；我们通过模拟实验验证了这一权衡关系——更强的未来奖励耦合会引发更高的梯度方差与更不稳定的学习过程。实验层面，我们识别出逐词采样OPD的三类失效模式：失衡的单步信号、学生生成前缀上不可靠的教师指导，以及分词器或特殊标记失配导致的失真。针对这些问题，我们提出教师Top-K局部支持匹配方法，通过采用带截断的反向KL散度、Top-P轨迹采样及特殊标记掩码技术实现。在单任务数学推理与多任务（智能体任务+数学任务）训练中的实验表明，该目标函数相比逐词采样OPD具有更稳定的优化过程与更优的下游性能。

English

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

重探在線蒸餾法：實證失效模式與簡易修正方案

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

摘要

Support