ChatPaper.aiChatPaper

超越教師的學習:基於獎勵推廣的廣義在線蒸餾方法 (注:標題採用學術論文常見的意譯手法,將"Generalized On-Policy Distillation with Reward Extrapolation"的核心概念轉化為符合中文論文標題習慣的表達。其中"On-Policy Distillation"譯為"在線蒸餾"是強化學習領域的標準譯法,"Reward Extrapolation"譯為"獎勵推廣"既保留原意又符合中文術語簡潔性要求。)

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

February 12, 2026
作者: Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
cs.AI

摘要

同策略蒸餿(OPD)透過讓學生模型在自身生成的軌跡上對齊教師模型的對數分佈,已在提升學生模型效能方面展現出顯著的實證優勢,其表現往往優於異策略蒸餿與強化學習(RL)範式。本研究首先從理論層面證明:OPD實為稠密KL約束強化學習的特例,其中獎勵函數與KL正則化的權重恆定相等,且參考模型可為任意模型。據此,我們提出廣義同策略蒸餿(G-OPD)框架,透過引入靈活的參考模型與控制獎勵項相對KL正則化權重的獎勵縮放因子,擴展了標準OPD的目標函數。在數學推理與程式碼生成任務的全面實驗中,我們獲得兩項新發現:(1)將獎勵縮放因子設定大於1(即獎勵外推法,稱之為ExOPD),能在多種師生模型規模配對下持續優於標準OPD。特別是在將經領域特定RL訓練的同一學生模型所獲得的各領域專家知識融合回原始學生模型時,ExOPD能使學生模型突破教師模型的效能邊界,表現超越領域專家教師。(2)基於ExOPD,我們進一步發現:在強弱模型蒸餿情境(即從較大教師模型蒸餿較小學生模型)中,選擇教師模型在RL訓練前的基礎模型作為參考模型進行獎勵校正,可產生更精確的獎勵信號並進一步提升蒸餿效能。但此方法需取得教師模型的預訓練版本且會增加計算負擔。我們的研究期望能為未來OPD相關研究提供新的思路。
English
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
PDF532February 14, 2026