在策略蒸馏的多面性：陷阱、机制与修复

摘要

在策略蒸馏（OPD）与在策略自蒸馏（OPSD）已成为大型语言模型有前景的后训练方法，它们能够对从模型自身策略采样的轨迹提供密集的token级监督。然而，关于其有效性的现有结论仍存在分歧：尽管OP(S)D在系统提示和知识内化方面展现出潜力，但近期研究也报告了其不稳定性与性能退化现象。本研究对OPD和OPSD何时有效、何时失效及其原因进行了全面的实证分析。我们发现，数学推理任务上的OPD对教师模型的选择和损失函数形式高度敏感，而OPSD在我们测试的设置中失效，原因在于测试阶段缺乏实例特定的特权信息（PI）。相反，当PI代表共享的潜在规则（如系统提示或对齐偏好）时，OPSD表现有效。我们识别出三种失效机制：（1）因以学生生成的token前缀为条件而导致教师与学生之间的分布不匹配；（2）由有偏的TopK反向KL散度梯度引发的优化不稳定性；（3）OPSD特有的局限性——学生模型学习到了忽略PI的聚合策略，该策略合并了以PI为条件的教师输出，当PI具有实例特异性时，这种策略效果不足。我们进一步证明，采用停止梯度的TopK目标、经RLVR适应的教师模型以及经SFT稳定化的学生模型能够缓解这些失效。

English

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

在策略蒸馏的多面性：陷阱、机制与修复

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

摘要

Support