LLM 추론에서 자기 증류를 위한 적응적 교사 노출

초록

온-정책 자기 증류(On-policy self-distillation)는 LLM 추론을 위한 강력한 방법론이 되었으며, 여기서 특권을 가진 교사(teacher)는 참조 해법을 조건으로 하여 학생(student)의 자체 롤아웃을 감독한다. 그러나 거의 모든 이러한 방법들이 공유하는 설계 선택이 의문시되지 않은 채 남아 있다: 교사는 항상 전체 참조 추론을 본다는 점이다. 우리는 이 기본값 자체가 문제의 일부라고 주장하며, 교사 측의 노출 불일치(exposure mismatch)를 식별한다: 교사가 학생의 현재 능력을 훨씬 넘어서는 추론을 조건으로 할 때, 결과로 생성되는 토큰 목표는 학생이 흡수하기에 너무 강력해진다. 통제된 고정 노출 스위프(fixed-exposure sweep)를 통해 이를 두 가지 측면에서 구체화한다: 1) 완전 노출이 항상 최선의 선택은 아니며, 2) 교사-학생 간 불일치는 교사가 더 많은 특권적 추론을 볼수록 단조롭게 증가한다. 이는 교사 노출을 고정된 하이퍼파라미터가 아닌 학습 가능한 훈련 시 제어 변수로 취급할 동기를 부여한다. 따라서 우리는 자기 증류를 위한 적응형 교사 노출(Adaptive Teacher Exposure for Self-Distillation, ATESD)을 제안한다. ATESD는 노출 비율을 소형 훈련 상태 통계에 조건화된 경량 베타 정책 제어기(Beta-policy controller)로 모델링하고, 샘플링된 하나의 노출을 짧은 유지 창(hold window) 동안의 학생 업데이트에 사용한다. 이 노출 제어기를 학습 가능하게 만들기 위해, 우리는 할인된 학습 진행 보상(discounted learning-progress reward)을 사용하여 최적화한다. 이 보수는 즉각적인 손실 변화가 아닌 학생의 미래 개선에 미치는 효과로 각 유지된 결정을 평가함으로써, 온-정책 증류에 의해 유도되는 지연된 신용 할당 문제를 해결한다. Qwen3-{1.7B, 4B, 8B} 모델에 대해 AIME 24, AIME 25 및 HMMT 25에서 수행한 실험 결과, ATESD는 경쟁력 있는 자기 증류 및 강화 학습 기준선을 일관되게 능가하며, 각각 OPSD 대비 Average@12 점수에서 +0.95, +2.05, +2.33 포인트의 향상을 보였고, 적응형 교사 노출을 추론 자기 증류를 위한 효과적인 새로운 축으로 확립하였다.

English

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.