OPRD: On-Policy-representatiedistillatie

Samenvatting

On-policy destillatie (OPD) begeleidt de student uitsluitend in de outputruimte door het matchen van volgende-tokenkansen. Dit uitsluitend-outputparadigma kent twee beperkingen: (1) steekproefvariantie van Monte Carlo KL-schattingen over grote vocabulaires (bijv. Qwens ~150k tokens) blijft gedurende de training bestaan, en (2) het behandelt de leraar als een zwarte doos, waarbij alle tussentijdse verborgen toestanden na de LM-kop worden genegeerd. Wij stellen On-Policy Representatiedestillatie (OPRD) voor, die destillatie naar de verborgen-toestandruimte tilt door student- en leraarrepresentaties over geselecteerde lagen op dezelfde rollouts uit te lijnen, waarbij de LM-kop volledig wordt omzeild. Theoretisch elimineert OPRD steekproefvariantie en biedt het rijkere structurele informatie per laag. Empirisch gezien sluit OPRD de student-leraarkloof op AIME 2024/2025 en AIMO, terwijl outputruimte OPD-baselines onder de leraar plafonneren. OPRD traint ook 1,44x sneller en gebruikt 54% minder geheugen dan top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.