大型語言模型推理中自我蒸餾的自適應教師曝光

摘要

在线策略自蒸馏已成为大型语言模型（LLM）推理任务中的一种强力方法，其核心在于特权教师模型在参考解的条件下监督学生自身生成的轨迹。然而，几乎所有此类方法共有的一个设计选择却鲜少被质疑：教师始终能观察到完整的参考推理过程。我们认为，这种默认设置本身就是问题的一部分，并识别出一种教师侧的曝光不匹配现象：当教师所依据的推理过程远超学生当前能力时，其生成的令牌目标会因过强而使学生难以吸收。通过一次受控的固定曝光量扫描实验，这一现象在两个维度上得到具体验证：1）完全曝光并非始终是最优选择；2）随着教师看到的特权推理内容增多，学生-教师之间的不匹配程度单调递增。这促使我们将教师曝光量从固定的超参数重新定位为一种可学习的训练时控制变量。为此，我们提出了自适应教师曝光自蒸馏方法（ATESD）。ATESD 利用一个轻量级的 Beta 策略控制器来建模揭示率，该控制器以紧凑的训练状态统计量为输入，并在一个较短的学生更新时间窗口内采用单次采样曝光。为使该曝光控制器可学习，我们采用一种折扣学习进度奖励对其进行优化：该奖励根据每次决策对学生未来改进的影响（而非其造成的即时损失变化）进行评分，从而解决了在线策略蒸馏中存在的延迟信用分配问题。在 AIME 24、AIME 25 和 HMMT 25 基准上，针对 Qwen3-{1.7B、4B、8B} 模型的实验表明，ATESD 持续优于竞争性的自蒸馏方法和强化学习基线，相较于 OPSD，其 Average@12 得分分别提升了 +0.95、+2.05 和 +2.33 分，从而将自适应教师曝光确立为推理自蒸馏领域中一个有效的新维度。

English

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.