並非所有分歧皆可學習：在線策略蒸餾中的標記可教性

摘要

同策略蒸餾（OPD）透過在學生模型自身的軌跡上施加詞元級別的教師監督來訓練學生。近期選擇性OPD方法利用OPD訊號的非均勻性，優先關注高熵或高分歧的詞元。我們重新審視此原則，並提出疑問：哪些詞元級別的教師訊號實際上是可學習的？我們採用一種固定上下文診斷方法，該方法衡量同一上下文中的教師-學生KL散度減少，從而證明原始的KL分歧僅是學習價值的粗略代理指標。它混淆了可學習分歧（教師將校正質量分配給學生的前K個候選項）與不相容分歧（教師將質量主要分配在學生當前支撐集之外）。我們將這種局部相容性形式化為詞元可教性，並證明其相比原始KL單獨預測時，能更準確地預測固定上下文的改進。受此發現啟發，我們提出可教性感知同策略蒸餾（TA-OPD），這是一種輕量級詞元位置選擇方法，無需獎勵模型或驗證器，即可針對高可教性位置施加OPD損失。在Qwen2.5與Qwen 3的教師-學生設定下，TA-OPD僅保留5%的詞元，其表現往往優於全詞元OPD，並超越基於熵和散度的基準方法。我們的結果將選擇性OPD重新架構為選擇可學習的教師訊號，而非僅僅挑選顯著詞元。

English

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.