反思大型语言模型的同策略蒸馏：现象学、机制与改进方案

摘要

同策略蒸馏（OPD）已成为大语言模型后训练的核心技术，但其训练动力学机制仍缺乏深入理解。本文对OPD的动态特性与作用机制进行了系统性探究。我们首先发现决定OPD成败的两个关键条件：（i）学生模型与教师模型需具备兼容的思维模式；（ii）即使思维模式一致且评分更高，教师模型必须提供学生模型在训练中未曾掌握的新能力。通过弱到强反向蒸馏实验，我们验证了同系列1.5B与7B参数规模的教师模型从学生模型视角看具有分布不可区分性。在令牌级机制层面，研究发现成功的OPD表现为学生模型访问状态下的高概率令牌渐进对齐——这些集中于97%-99%概率质量的小规模共享令牌集合构成了训练关键。我们进一步提出两种实用策略挽救失败的OPD：离策略冷启动和教师对齐提示选择。最后研究揭示，OPD看似免费的密集令牌级奖励实则存在代价，这引发了对OPD能否扩展到长周期蒸馏场景的质疑。

English

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

反思大型语言模型的同策略蒸馏：现象学、机制与改进方案

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

摘要

Support