並非所有分歧皆可學習:在線策略蒸餾中的標記可教性
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
May 26, 2026
作者: Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang
cs.AI
摘要
同策略蒸餾(OPD)透過在學生模型自身的軌跡上施加詞元級別的教師監督來訓練學生。近期選擇性OPD方法利用OPD訊號的非均勻性,優先關注高熵或高分歧的詞元。我們重新審視此原則,並提出疑問:哪些詞元級別的教師訊號實際上是可學習的?我們採用一種固定上下文診斷方法,該方法衡量同一上下文中的教師-學生KL散度減少,從而證明原始的KL分歧僅是學習價值的粗略代理指標。它混淆了可學習分歧(教師將校正質量分配給學生的前K個候選項)與不相容分歧(教師將質量主要分配在學生當前支撐集之外)。我們將這種局部相容性形式化為詞元可教性,並證明其相比原始KL單獨預測時,能更準確地預測固定上下文的改進。受此發現啟發,我們提出可教性感知同策略蒸餾(TA-OPD),這是一種輕量級詞元位置選擇方法,無需獎勵模型或驗證器,即可針對高可教性位置施加OPD損失。在Qwen2.5與Qwen 3的教師-學生設定下,TA-OPD僅保留5%的詞元,其表現往往優於全詞元OPD,並超越基於熵和散度的基準方法。我們的結果將選擇性OPD重新架構為選擇可學習的教師訊號,而非僅僅挑選顯著詞元。
English
On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.