學習預見：揭示在線策略蒸餾的效率解鎖

摘要

同策略蒸馏（On-policy Distillation, OPD）已成為大型語言模型高效的後訓練範式。然而，現有研究大多將其優勢歸因於更密集且穩定的監督訊號，而OPD效率背後參數層級的機制仍未獲充分理解。本研究主張，OPD的效率源自一種「遠見」特性：它在訓練初期即建立趨向最終模型的穩定更新軌跡。此遠見體現在兩個層面。首先，在模組分配層級，OPD能識別具有低邊際效用的區域，並將更新集中於對推理更關鍵的模組。其次，在更新方向層級，OPD展現更強的低秩集中性，其主導子空間在訓練初期即與最終更新子空間高度對齊。基於這些發現，我們提出EffOPD——一種即插即用的加速方法，透過自適應選擇外推步長並沿當前更新方向移動，來加速OPD。EffOPD無需額外可訓練模組或複雜的超參數調校，可在維持相當最終表現的同時，實現平均3倍的訓練加速。總體而言，本研究從參數動態視角闡釋OPD的效率機制，並為設計更高效的大型語言模型後訓練方法提供實務洞見。

English

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the Module-Allocation Level, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the Update-Direction Level, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose EffOPD, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of 3times while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.