OPRD: 온-정책 표현 증류

초록

온-정책 증류(On-policy distillation, OPD)는 다음 토큰 확률을 일치시킴으로써 출력 공간에서만 학생 모델을 지도한다. 이러한 출력 전용 패러다임에는 두 가지 한계가 있다: (1) 대규모 어휘(예: Qwen의 약 15만 토큰)에 대한 Monte Carlo KL 추정치의 샘플링 분산이 훈련 과정 전반에 걸쳐 지속되며, (2) 교사 모델을 블랙박스로 취급하여 LM 헤드 이후의 모든 중간 은닉 상태를 무시한다. 우리는 온-정책 표현 증류(On-Policy Representation Distillation, OPRD)를 제안한다. 이는 동일한 롤아웃 상의 선택된 층들에서 학생과 교사의 표현을 정렬함으로써 증류를 은닉 상태 공간으로 끌어올려 LM 헤드를 완전히 우회한다. 이론적으로 OPRD는 샘플링 분산을 제거하고 층별로 더 풍부한 구조적 정보를 제공한다. 실험적으로 OPRD는 AIME 2024/2025 및 AIMO에서 학생-교사 간 격차를 해소하는 반면, 출력 공간 OPD 기준선은 교사 이하에서 정체된다. 또한 OPRD는 top-k OPD보다 1.44배 빠르게 훈련되고 54% 적은 메모리를 사용한다. 코드: https://github.com/ShenzhiYang2000/OPRD.

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.