OPRD: Destilação de Representação On-Policy

Resumo

Destilação on-policy (OPD) supervisiona o estudante apenas no espaço de saída, igualando probabilidades do próximo token. Esse paradigma exclusivo de saída apresenta dois limites: (1) a variância de amostragem decorrente de estimativas KL de Monte Carlo sobre vocabulários grandes (por exemplo, os ~150 mil tokens do Qwen) persiste ao longo do treinamento, e (2) trata o professor como uma caixa-preta, descartando todos os estados ocultos intermediários após a cabeça do LM. Propomos a Destilação de Representações On-Policy (OPRD), que eleva a destilação ao espaço de estados ocultos, alinhando representações do estudante e do professor em camadas selecionadas durante os mesmos rollouts, ignorando completamente a cabeça do LM. Teoricamente, a OPRD elimina a variância de amostragem e fornece informações estruturais mais ricas por camada. Empiricamente, a OPRD reduz a lacuna estudante-professor no AIME 2024/2025 e no AIMO, enquanto as linhas de base de OPD no espaço de saída estagnam abaixo do professor. A OPRD também treina 1,44x mais rápido e usa 54% menos memória que a OPD top-k. Código: https://github.com/ShenzhiYang2000/OPRD.

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.