OPRD: 在策略表示蒸馏
OPRD: On-Policy Representation Distillation
June 4, 2026
作者: Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen
cs.AI
摘要
在线策略蒸馏(OPD)仅通过匹配下一个词元的概率在输出空间监督学生模型。这种纯输出范式存在两个局限:(1) 在大词汇表(例如Qwen的约15万词元)上,蒙特卡洛KL估计产生的采样方差在整个训练过程中持续存在;(2) 它将教师模型视为黑箱,丢弃了语言模型头之后的所有中间隐状态。我们提出在线策略表示蒸馏(OPRD),通过在相同轨迹上对齐选定层的学生与教师表示,将蒸馏提升至隐状态空间,完全绕过语言模型头。理论上,OPRD消除了采样方差,并提供了更丰富的每层结构信息。实验上,OPRD在AIME 2024/2025和AIMO上缩小了学生与教师之间的差距,而输出空间OPD基线在教师水平以下停滞不前。与Top-k OPD相比,OPRD训练速度提升1.44倍,内存使用减少54%。代码:https://github.com/ShenzhiYang2000/OPRD。
English
On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.