OPRD: オンポリシー表現蒸留

要旨

方策オン蒸留（OPD）は、次のトークン確率を一致させることで、出力空間のみにおいて生徒モデルを教師モデルに監督する。この出力のみのパラダイムには二つの限界がある：（1）大語彙（例：Qwenの約15万トークン）に対するモンテカルロKL推定からのサンプリング分散が訓練を通じて持続すること、（2）教師をブラックボックスとして扱い、LMヘッド以降の中間隠れ状態をすべて破棄することである。本稿では、方策オン表現蒸留（OPRD）を提案する。これは、同一ロールアウト上の選択された層において生徒と教師の表現を整列させることで、蒸留を隠れ状態空間に拡張し、LMヘッドを完全に迂回する。理論的には、OPRDはサンプリング分散を排除し、より豊かな層ごとの構造情報を提供する。実験的には、OPRDはAIME 2024/2025およびAIMOにおいて生徒・教師間のギャップを縮小する一方、出力空間のOPDベースラインは教師以下の水準で停滞する。また、OPRDはtop-k OPDよりも1.44倍高速に訓練でき、メモリ使用量を54%削減する。コード：https://github.com/ShenzhiYang2000/OPRD。

English

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.