Niet alle discrepantie is leerbaar: Token-leerbaarheid in on-policy distillatie

Samenvatting

On-beleidsdistillatie (OPD, van het Engelse 'on-policy distillation') traint een student op diens eigen roll-outs met supervisie van een leraar op tokenniveau. Recente selectieve OPD-methoden maken gebruik van de niet-uniformiteit van OPD-signalen door prioriteit te geven aan tokens met hoge entropie of hoge discrepantie. Wij heroverwegen dit principe en vragen: welke lerarensignalen op tokenniveau zijn daadwerkelijk leerbaar? Met behulp van een diagnostiek met vaste context die de KL-reductie tussen leraar en student binnen dezelfde context meet, tonen wij aan dat ruwe KL-discrepantie een grove benadering is van de leerwaarde. Het verwart leerbare discrepantie, waarbij de leraar corrigerende massa toekent aan de top-k-kandidaten van de student, met incompatibele discrepantie, waarbij de leraar massa voornamelijk buiten de huidige drager van de student plaatst. Wij formaliseren deze lokale compatibiliteit als token-leerbaarheid en laten zien dat deze de verbetering in vaste context beter voorspelt dan ruwe KL alleen. Gemotiveerd door deze bevinding stellen wij Leerbaarheidsbewuste OPD (TA-OPD, van het Engelse 'Teachability-Aware OPD') voor, een lichtgewicht tokenpositie-selectiemethode die OPD-verlies toepast op posities met hoge leerbaarheid, zonder beloningsmodellen of verificateurs. In Qwen2.5- en Qwen 3-leraar-studentomgevingen presteert TA-OPD vaak beter dan volledige-token OPD met slechts 5% behouden tokens en verbetert het de resultaten ten opzichte van op entropie en divergentie gebaseerde basislijnen. Onze resultaten herformuleren selectieve OPD als het selecteren van leerbare lerarensignalen in plaats van louter opvallende tokens.

English

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.