并非所有分歧都是可学习的：同策略蒸馏中的词元可教性

摘要

同策略蒸馏（OPD）通过词元级教师监督，在学生自身的轨迹展开中训练其模型。近年来的选择性OPD方法利用OPD信号的非均匀性，优先处理高熵或高分歧词元。我们重新审视这一原则，并提出问题：哪些词元级教师信号实际上是可学习的？通过采用固定上下文诊断方法——即测量相同上下文下师生KL散度的降低量——我们发现原始KL分歧是学习价值的粗略代理指标。它将可学习分歧（教师将纠正性概率质量分配给学生当前的前K个候选词元）与不兼容分歧（教师将概率质量主要置于学生当前支持范围之外）混为一谈。我们将这种局部兼容性形式化为词元可教性，并证明它比单独的原始KL更能预测固定上下文中的改进。受此发现启发，我们提出了可教性感知同策略蒸馏（TA-OPD），这是一种轻量级的词元位置选择方法，无需奖励模型或验证器，即可对高可教性位置应用OPD损失。在Qwen2.5和Qwen 3的师生设置中，TA-OPD仅保留5%的词元即可常超越全词元OPD，并优于基于熵和散度的基线方法。我们的研究结果将选择性OPD重新定义为选择可学习的教师信号，而非仅仅挑选显著性词元。

English

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.