ChatPaper.aiChatPaper

OPRD: 在策略表示蒸餾

OPRD: On-Policy Representation Distillation

June 4, 2026
作者: Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen
cs.AI

摘要

同軌蒸餾(OPD)僅透過比對下一個詞元的機率來監督學生模型,這種僅限於輸出空間的範式有兩項限制:(1)在大詞彙量(例如 Qwen 約 15 萬個詞元)下,蒙地卡羅 KL 估計產生的取樣變異數會在整個訓練過程中持續存在;(2)它將教師模型視為黑箱,完全丟棄了語言模型輸出頭之後的所有中間隱藏狀態。我們提出同軌表徵蒸餾(OPRD),將蒸餾提升至隱藏狀態空間,透過在相同軌跡上比對學生與教師模型所選層級的表徵,完全繞過語言模型輸出頭。理論上,OPRD 消除了取樣變異數,並提供更豐富的逐層結構資訊。實證上,OPRD 在 AIME 2024/2025 與 AIMO 上縮小了師生模型間的差距,而僅輸出空間的 OPD 基線則在低於教師模型的效能上停滯不前。OPRD 的訓練速度比 top-k OPD 快 1.44 倍,且記憶體用量減少 54%。程式碼:https://github.com/ShenzhiYang2000/OPRD。
English
On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.