OPRD: 在策略表示蒸餾

摘要

同軌蒸餾（OPD）僅透過比對下一個詞元的機率來監督學生模型，這種僅限於輸出空間的範式有兩項限制：（1）在大詞彙量（例如 Qwen 約 15 萬個詞元）下，蒙地卡羅 KL 估計產生的取樣變異數會在整個訓練過程中持續存在；（2）它將教師模型視為黑箱，完全丟棄了語言模型輸出頭之後的所有中間隱藏狀態。我們提出同軌表徵蒸餾（OPRD），將蒸餾提升至隱藏狀態空間，透過在相同軌跡上比對學生與教師模型所選層級的表徵，完全繞過語言模型輸出頭。理論上，OPRD 消除了取樣變異數，並提供更豐富的逐層結構資訊。實證上，OPRD 在 AIME 2024/2025 與 AIMO 上縮小了師生模型間的差距，而僅輸出空間的 OPD 基線則在低於教師模型的效能上停滯不前。OPRD 的訓練速度比 top-k OPD 快 1.44 倍，且記憶體用量減少 54%。程式碼：https://github.com/ShenzhiYang2000/OPRD。

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.