超越教师学习:基于奖励推断的广义策略蒸馏方法
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
February 12, 2026
作者: Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
cs.AI
摘要
同策略蒸馏(OPD)通过让学生模型在自身生成的轨迹上与教师模型的逻辑分布对齐,已在提升学生模型性能方面展现出显著的经验性优势,其表现往往优于异策略蒸馏和强化学习范式。本文首先从理论上证明,OPD是稠密KL约束强化学习的一种特殊形式——其中奖励函数与KL正则化始终等权重组合,且参考模型可为任意模型。基于此,我们提出广义同策略蒸馏框架,通过引入灵活可变的参考模型和奖励缩放因子来调控奖励项与KL正则化的相对权重,从而扩展了标准OPD的目标函数。在数学推理和代码生成任务上的系统性实验揭示了两项新发现:(1)当奖励缩放因子大于1时(即奖励外推策略,简称ExOPD),在不同规模的师生模型配对中均能稳定超越标准OPD。特别是在将领域专家知识(通过对同一学生模型实施领域特定强化学习获得)回传至原学生模型的场景中,ExOPD甚至能突破教师模型的性能边界,实现对各领域教师的反超。(2)在ExOPD基础上进一步发现,在强师弱生的蒸馏场景中,选择教师模型进行强化学习前的初始版本作为参考模型实施奖励修正,可提供更精确的奖励信号并进一步提升蒸馏性能。但该方法需获取教师模型的预强化学习版本,且会增加计算开销。我们的研究有望为同策略蒸馏的未来探索提供新视角。
English
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.