OmniOPD: 投機的検証によるロジットフリーのオンポリシー蒸留

要旨

オンポリシー蒸留（On-Policy Distillation, OPD）は、強力な教師モデルからの高密度なトークンレベルのフィードバックのもとで、生徒モデル自身の生成軌跡に対して学習を行う手法であり、教師ありファインチューニング（SFT）におけるオフポリシー分布シフトと、強化学習（RL）における疎なクレジット割り当ての両方を緩和する。しかし、標準的なOPDには相互に関連する二つの限界がある。第一に、教師のトークンレベルのロジットへの直接アクセスを必要とするため、広範な高性能プロプライエタリモデルを教師として利用できない。第二に、トークンレベルのロジット信号自体が脆弱であり、教師と生徒の間で妥当な次トークンの重なりが狭いことに依存し、繰り返しループのような劣化パターンを増幅しやすい。本論文では、ロジットを必要とせずチャンク単位の教師信号を用いる新たな枠組み、OmniOPDを導入する。OmniOPDは、決定的なロジットマッチングを、連続的な意味的類似度指標に基づいて複数トークンのチャンクに対する教師の局所的な選好を近似するモンテカルコロールアウトに置き換え、さらに、生徒の高不確実性な推論分岐点でのみ監査を行うピークエントロピースケジューラによってこの教師信号を集中させる。ディリクレ多項ベイズ事前分布とベースモデルKLアンカーは、離散サンプリングの分散をさらに抑制し、監査されないトークンにおける方策崩壊を防ぐ。競争力のあるベンチマークにおいて、OmniOPDは標準的なOPD手法を数学で最大+28.64%上回り、チャンク単位の意味検証がトークンレベルのロジットマッチングよりも信頼性の高い学習信号を抽出できることを確認した。トークンレベルのロジットマッチングは情報密度が高いものの、それに伴うノイズと脆弱性によってその利点が相殺されるのである。さらに、Claude-4.5-HaikuやGemini-2.5-Flashなどの強力なブラックボックス教師と組み合わせると、OmniOPDはオープンウェイト教師を用いた場合と比較して数学で相対+9.54%の向上を達成し、生徒モデルを自己探索型RLの性能を超える水準へと押し上げる。

English

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.