Adaptive Lehrer-Exposition für Selbstdestillation im LLM-Reasoning

Zusammenfassung

On-Policy-Selbstdestillation hat sich als wirksames Rezept für das Reasoning großer Sprachmodelle (LLMs) etabliert, bei dem ein privilegierter Lehrer die eigenen Rollouts des Schülers überwacht und dabei auf die Referenzlösung konditioniert. Eine Designentscheidung, die nahezu allen derartigen Methoden gemeinsam ist, wurde jedoch nie hinterfragt: Der Lehrer sieht stets das vollständige Referenz-Reasoning. Wir argumentieren, dass diese Voreinstellung selbst Teil des Problems ist und identifizieren eine lehrerseitige Expositionsdiskrepanz: Wenn der Lehrer auf Reasoning konditioniert, das weit über die aktuelle Kompetenz des Schülers hinausgeht, werden die resultierenden Tokenziele zu stark, um absorbiert zu werden. Ein kontrollierter Durchlauf mit festgelegter Exposure verdeutlicht dies in zweierlei Hinsicht: 1) Vollständige Exposure ist nicht durchgängig die beste Wahl, und 2) die Lehrer-Schüler-Diskrepanz wächst monoton, je mehr privilegiertes Reasoning der Lehrer sieht. Dies motiviert, die Lehrer-Exposure nicht als festen Hyperparameter, sondern als lernbare Kontrollvariable zur Trainingszeit zu behandeln. Daher schlagen wir Adaptive Teacher Exposure for Self-Distillation (ATESD) vor. ATESD modelliert das Offenlegungsverhältnis mit einem leichten Beta-Policy-Controller, der auf kompakten Trainingszustandsstatistiken konditioniert ist, und verwendet eine einmal exponierte Exposure für ein kurzes Haltefenster von Schüler-Updates. Um diesen Exposure-Controller lernbar zu machen, optimieren wir ihn mit einer diskontierten Lernfortschrittsbelohnung, die jede getroffene Entscheidung anhand ihrer Auswirkungen auf die zukünftige Verbesserung des Schülers bewertet und nicht anhand der unmittelbaren Verluständerung – dies adressiert die verzögerte Kreditzuweisung, die durch On-Policy-Destillation entsteht. Experimente auf AIME 24, AIME 25 und HMMT 25 mit Qwen3-{1.7B, 4B, 8B} zeigen, dass ATESD durchweg bessere Ergebnisse erzielt als konkurrierende Self-Distillation- und RL-Baselines, mit Verbesserungen gegenüber OPSD um +0,95, +2,05 bzw. +2,33 Average@12 Punkte, und etabliert adaptive Lehrer-Exposure als wirksame neue Achse für Reasoning-Selbstdestillation.

English

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.