少即是多：在策略蒸餾中的早停展開

摘要

近期，同策略蒸馏作为一种替代标准序列级模仿学习的有前景方案崭露头角，该方案通过使用教师模型对学生自身生成的轨迹进行评分来训练学生模型。然而，我们观察到这一范式存在“异策略教师衰退”问题：对于后续token而言，当学生模型早期的轨迹作为上下文（该上下文相对于教师模型属于异策略数据）时，教师模型生成校正分数的能力会逐渐衰减，可能退化为预训练阶段习得的token补全行为。我们通过实验验证了该问题，并提出早停式生成（ESR）加以解决——这是一种简单但有效的蒸馏策略，仅需限制生成轨迹的首批响应token。研究表明，ESR在不同模型规模、模型族、任务类型及训练范式下，不仅全面超越完整轨迹生成式同策略蒸馏的性能表现，还展现出显著更高的GPU效率与训练稳定性，尤其在跨模型族场景中优势更为突出。我们进一步探索了该惊人性能背后的机制，发现ESR的“级联对齐”与“子模态锁定”效应可解释其有效运行的原因，甚至有时能超越教师模型性能。此外，我们证明这种基于位置的token选择策略无法完全通过KL散度与熵信号加以解释。

English

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.