LLM推論における自己蒸留のための適応的教師露出

要旨

オン方策自己蒸留は、LLM推論の強力な手法となっている。ここでは、特権的な教師が参照解を条件として生徒自身のロールアウトを監督する。しかし、こうした手法のほぼすべてに共通する設計上の選択が疑問視されることなく受け入れられてきた。すなわち、教師は常に完全な参照推論を見るという点である。我々は、このデフォルト自体が問題の一部であると主張し、教師側の露出ミスマッチを特定する。教師が生徒の現在の能力をはるかに超えた推論を条件とした場合、結果として得られるトークンターゲットが強すぎて吸収できなくなるのである。制御された固定露出スイープにより、このことは2つの観点から具体化される。1) 完全露出が確実に最良の選択とは限らないこと、2) 教師がより多くの特権的推論を見るにつれて、生徒-教師ミスマッチが単調に増大することである。これにより、教師露出を固定ハイパーパラメータとしてではなく、学習可能な訓練時制御変数として扱う動機が得られる。そこで我々は、自己蒸留のための適応的教師露出（ATESD）を提案する。ATESDは、コンパクトな訓練状態統計量を条件とした軽量なベータ方策制御器を用いて露出比率をモデル化し、サンプリングされた1つの露出を生徒更新の短いホールドウィンドウに使用する。この露出制御器を学習可能にするために、割引学習進捗報酬を用いて最適化する。この報酬は、各ホールドされた決定を即時損失変化ではなく生徒の将来の改善への影響で評価し、オン方策蒸留によって引き起こされる遅延クレジット割り当てに対処する。 AIME 24、AIME 25、HMMT 25におけるQwen3-{1.7B, 4B, 8B}を用いた実験では、ATESDが競合する自己蒸留およびRLベースラインを一貫して上回り、OPSDと比較してそれぞれ+0.95、+2.05、+2.33のAverage@12ポイントの改善を示し、適応的教師露出を推論自己蒸留の効果的な新しい軸として確立している。

English

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.