궤적 정제 증류

초록

온-정책 증류(OPD)는 대규모 언어 모델(LLM)의 주요 사후 학습 도구가 되었으며, 학생 모델 자체의 롤아웃 과정에서 토큰별로 조밀한 교사 감독을 제공한다. 본 연구에서는 OPD의 근본적인 구조적 원인을 발견하고 이를 접두사 실패(prefix failure)라고 명명한다. 접두사 실패 하에서 토큰 수준의 조밀한 감독은 이중 모드 교사 혼합(bimodal teacher mixture)과 분할된 그래디언트(fragmented gradients)를 유발하며, 이는 토큰 수준 손실 절단(token-level loss truncation)이나 재가중치 부여(reweighting)로 해결할 수 없다. 이러한 관찰은 토큰 수준의 손실 개입을 넘어 궤적 수준의 출력 보정으로 나아가도록 동기를 부여한다. 이에 따라 본 연구는 궤도 정제 증류(TRD)를 제안한다. TRD는 궤적 수준의 보정 방법으로, 온-정책 지원 범위 내에서 교사 안내 하에 학생의 롤아웃을 수정한다. 증류 전에 문제가 있는 접두사를 수정함으로써 TRD는 접두사 실패를 근원에서 완화한다. 또한, TRD는 원래 롤아웃이 이미 올바른 경우에도 교사 안내 하에 학생이 대안적인 유효 추론 과정에 노출되도록 하여 탐색을 개선한다. TRD는 또한 학생 모델을 특권 정보 조건에서 교사로 사용하는 매개변수 공유 변형인 온-정책 자기 증류(OPSD)에도 적용할 수 있다. 다양한 벤치마크와 여러 규모의 기본 모델에서 TRD는 기존 기준을 일관되게 능가하며, 단일 시도 정확도를 향상시키고 추론 적용 범위를 확장한다. 코드는 https://github.com/louieworth/trd에서 확인할 수 있다.

English

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd