并非所有分歧都是可学习的:同策略蒸馏中的词元可教性
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
May 26, 2026
作者: Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang
cs.AI
摘要
同策略蒸馏(OPD)通过词元级教师监督,在学生自身的轨迹展开中训练其模型。近年来的选择性OPD方法利用OPD信号的非均匀性,优先处理高熵或高分歧词元。我们重新审视这一原则,并提出问题:哪些词元级教师信号实际上是可学习的?通过采用固定上下文诊断方法——即测量相同上下文下师生KL散度的降低量——我们发现原始KL分歧是学习价值的粗略代理指标。它将可学习分歧(教师将纠正性概率质量分配给学生当前的前K个候选词元)与不兼容分歧(教师将概率质量主要置于学生当前支持范围之外)混为一谈。我们将这种局部兼容性形式化为词元可教性,并证明它比单独的原始KL更能预测固定上下文中的改进。受此发现启发,我们提出了可教性感知同策略蒸馏(TA-OPD),这是一种轻量级的词元位置选择方法,无需奖励模型或验证器,即可对高可教性位置应用OPD损失。在Qwen2.5和Qwen 3的师生设置中,TA-OPD仅保留5%的词元即可常超越全词元OPD,并优于基于熵和散度的基线方法。我们的研究结果将选择性OPD重新定义为选择可学习的教师信号,而非仅仅挑选显著性词元。
English
On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.