少即是多：面向在策略蒸馏的早停展开

摘要

在线策略蒸馏最近作为一种有前景的替代方案出现，用于替代标准的序列级模仿学习，它通过教师模型对学生自身生成的序列进行评分来训练学生。然而，我们观察到该范式中存在的"离线策略教师衰减"问题：对于后面的token，由于学生的早期轨迹作为上下文对于教师模型来说是离策略的，教师产生修正性评分的能力会衰减，并可能退回到预训练阶段学到的token补全行为。我们通过实验验证了该问题，并提出了早停式生成（Early Stopping Rollout, ESR）来修复它：一种简单且有效的蒸馏策略，仅限制生成序列的前几个响应token。我们表明，ESR不仅在模型规模、家族、任务和训练方案上超越了完整序列生成的OPD性能，而且在跨模型家族场景下展现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制，发现了ESR的"级联对齐"和"子模式承诺"效应，这可能解释了它为何有效，甚至有时能超越教师模型性能。此外，我们表明这种基于位置的token选择策略无法完全由KL散度和熵信号解释。

English

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.