No Todo Desacuerdo Es Aprendible: Enseñabilidad de Tokens en Destilación On-Policy

Resumen

La destilación on-policy (OPD) entrena a un estudiante mediante sus propias trayectorias con supervisión del profesor a nivel de token. Métodos selectivos recientes de OPD explotan la no uniformidad de las señales de OPD priorizando tokens de alta entropía o alto desacuerdo. Reexaminamos este principio y preguntamos: ¿qué señales del profesor a nivel de token son realmente aprendibles? Mediante un diagnóstico de contexto fijo que mide la reducción de KL profesor-estudiante en el mismo contexto, mostramos que el desacuerdo KL bruto es una aproximación burda del valor de aprendizaje. Este confunde el desacuerdo aprendible, donde el profesor asigna masa correctiva a los candidatos top-K del estudiante, con el desacuerdo incompatible, donde el profesor coloca masa principalmente fuera del soporte actual del estudiante. Formalizamos esta compatibilidad local como enseñabilidad del token y mostramos que predice mejor la mejora en contexto fijo que el KL bruto por sí solo. Motivados por este hallazgo, proponemos OPD Consciente de la Enseñabilidad (TA-OPD), un método ligero de selección de posiciones de tokens que aplica la pérdida de OPD en posiciones de alta enseñabilidad sin modelos de recompensa ni verificadores. En configuraciones profesor-estudiante de Qwen2.5 y Qwen 3, TA-OPD a menudo supera a OPD de tokens completos reteniendo solo el 5% de los tokens y mejora respecto a líneas base basadas en entropía y divergencia. Nuestros resultados reformulan la OPD selectiva como la selección de señales aprendibles del profesor en lugar de meramente tokens salientes.

English

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.