轨迹精炼蒸馏

摘要

在线策略蒸馏（OPD）已成为大型语言模型（LLMs）后训练阶段的核心技术，它在学生模型自身轨迹生成过程中提供密集的逐令牌教师监督。在本文中，我们识别出OPD背后一种常见的结构性问题，并将其称为前缀失败。在前缀失败下，密集的逐令牌监督会引发双峰教师混合和碎片化梯度，而令牌级损失截断或重新加权等方法无法解决这一问题。这一观察促使我们超越令牌级损失干预，转向轨迹级输出修正。因此，我们提出轨迹精炼蒸馏（TRD），一种轨迹级修正方法，它在保持在线策略支持的前提下，根据教师指导修正学生模型的生成轨迹。通过在蒸馏前修正有问题的前缀，TRD从根源上缓解前缀失败。此外，即使原始轨迹已经正确，TRD也能通过让学生在教师指导下接触替代的有效推导路径来提升探索能力。TRD同样适用于在线策略自蒸馏（OPSD）——一种教师模型为带有特权信息条件的学生模型的参数共享变体。在多种基准测试和不同规模的基座模型上，TRD持续优于先前基线，提高了单次尝试的准确性并拓宽了推理覆盖范围。代码已开源：https://github.com/louieworth/trd

English

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd