OPRD: On-Policy-Repräsentationsdestillation

Zusammenfassung

On-Policy-Destillation (OPD) beaufsichtigt den Schüler nur im Ausgaberaum, indem die Wahrscheinlichkeiten des nächsten Tokens abgeglichen werden. Dieses ausgabeorientierte Paradigma weist zwei Einschränkungen auf: (1) Die Stichprobenvarianz aus Monte-Carlo-KL-Schätzungen über große Vokabulare (z. B. Qwens ~150k Tokens) bleibt während des gesamten Trainings bestehen, und (2) es behandelt den Lehrer als Blackbox und verwirft alle versteckten Zwischenzustände nach dem LM-Head. Wir schlagen On-Policy-Repräsentations-Destillation (OPRD) vor, die die Destillation in den Zustandsraum der versteckten Schichten verlagert, indem die Repräsentationen von Schüler und Lehrer über ausgewählte Schichten bei denselben Rollouts abgeglichen werden, unter vollständiger Umgehung des LM-Heads. Theoretisch eliminiert OPRD die Stichprobenvarianz und liefert reichhaltigere strukturelle Informationen pro Schicht. Empirisch schließt OPRD die Schüler-Lehrer-Lücke bei AIME 2024/2025 und AIMO, während ausgabeorientierte OPD-Baselines unterhalb des Lehrers ein Plateau erreichen. OPRD trainiert zudem 1,44‑mal schneller und verbraucht 54 % weniger Speicher als Top‑k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.