重新審視大型語言模型的同策略蒸餾：現象學、機制與方法論

摘要

同策略蒸餿（OPD）已成為大型語言模型後訓練的核心技術，但其訓練動態機制仍鮮為人知。本文對OPD的動態特性與作用機制進行系統性研究。我們首先確立了決定OPD成敗的兩個關鍵條件：（i）學生模型與教師模型需具備相容的思維模式；（ii）即使思維模式一致且評分更高，教師模型仍需提供學生模型在訓練過程中未曾掌握的新能力。通過弱到強反向蒸餿實驗驗證發現：同模型家族的1.5B與7B教師模型從學生模型視角來看具有分布不可區分性。在詞元層面機制探究中，我們發現成功的OPD表現為學生模型訪問狀態下高概率詞元的漸進對齊，這些集中在總概率質量97%-99%的少量共享詞元構成關鍵樞紐。我們進一步提出兩種實用策略修復失敗的OPD：離策略冷啟動與教師對齊提示選擇。最後研究揭示，OPD看似免費的密集詞元級獎勵實則存在隱性成本，這引發了OPD能否擴展至長視野蒸餿的深層思考。

English

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

重新審視大型語言模型的同策略蒸餾：現象學、機制與方法論

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

摘要

Support