OmniOPD：基於推測驗證的無Logit同策略蒸餾

摘要

On-Policy Distillation (OPD) 通過讓學生模型在其自身生成軌跡上，接受來自更強教師的密集詞元級反饋進行訓練，從而緩解了監督式微調 (SFT) 的離策略分佈偏移與強化學習 (RL) 的稀疏信用分配問題。然而，標準 OPD 面臨兩個相互關聯的限制。首先，它需要直接獲取教師的詞元級 logits，這排除了大量能力出眾的專有模型擔任教師的可能性。其次，詞元級 logit 信號本身十分脆弱，其依賴於教師與學生之間在合理下一個詞元上的狹窄重疊，且容易放大如重複循環這類退化模式。在本文中，我們提出 OmniOPD，這是一個新穎的框架，透過一種無需 logit 的區塊級監督信號，同時解決了上述兩個限制。OmniOPD 將確定性的 logit 匹配替換為蒙地卡羅 rollout，藉由在多詞元區塊上使用連續語意相似性度量來近似教師的局部偏好，並透過一個峰值熵調度器集中此監督信號，僅在學生高不確定性的推理分支處進行審核。此外，狄利克雷-多項式貝氏先驗與基礎模型 KL 錨點進一步約束了離散取樣的變異數，並防止在未經審核的詞元上發生策略崩塌。在具競爭力的基準測試中，OmniOPD 在數學任務上比標準 OPD 方法提升了高達 +28.64%，證實了區塊級語意驗證能提取出比詞元級 logit 匹配更可靠的學習信號，後者的高資訊密度被顯著的雜訊與脆弱性所抵消。此外，當與 Claude-4.5-Haiku 和 Gemini-2.5-Flash 等更強的黑箱教師配對時，OmniOPD 在數學任務上相較於其開放權重教師版本，取得了額外的 +9.54% 相對提升，使學生模型超越了自我探索式 RL 的性能。

English

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.