軌跡精煉蒸餾

摘要

基於策略蒸餾（OPD）已成為大型語言模型（LLM）的核心後訓練工具，它能沿著學生模型自身的生成路徑提供密集的逐標記教師監督。本研究發現OPD存在一個共同的結構性成因，稱之為前綴失敗（prefix failure）。在前綴失敗下，密集的逐標記監督會引發雙峰教師混合分佈與碎片化梯度，而逐標記損失截斷或重新加權均無法解決此問題。此發現促使我們從逐標記損失干預轉向軌跡層級的輸出校正。為此，我們提出軌跡精煉蒸餾（TRD），這是一種在符合策略支持範圍內，基於教師引導修正學生模型生成軌跡的軌跡層級校正方法。TRD在蒸餾前修正問題前綴，從根源上緩解前綴失敗。此外，即使學生原始生成結果已正確，TRD也能透過教師引導暴露學生於替代的有效推導路徑，從而提升探索能力。TRD亦可應用於基於策略自蒸餾（OPSD）——一種使用條件化於特權資訊的學生模型作為教師的參數共享變體。在涵蓋多種規模的多組基準測試與基礎模型中，TRD始終優於先前基線，提升單次嘗試準確率並拓展推理覆蓋範圍。程式碼已於 https://github.com/louieworth/trd 公開。

English

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd