OmniOPD：基于推测验证的无Logit在策略蒸馏

摘要

在线策略蒸馏（OPD）通过让学生在强教师模型提供的密集词元级反馈下，基于自身生成轨迹进行训练，同时缓解了监督微调（SFT）的离策略分布偏移和强化学习（RL）的稀疏信用分配问题。然而，标准OPD面临两个相互关联的局限性。首先，它需要直接访问教师模型的词元级对数几率，这排除了大量功能强大的专有模型担任教师角色的可能性。其次，词元级对数几率信号本身具有脆性，依赖于教师与学生之间合理下一词元的狭窄重叠，且容易放大重复循环等退化模式。本文提出OmniOPD这一新框架，通过一种无需对数几率、基于片段级的监督信号来同时解决这两个局限性。OmniOPD用蒙特卡洛轨迹展开取代确定性对数几率匹配，该方法通过多词元片段上的连续语义相似度度量来近似教师模型的局部偏好，并通过峰值熵调度器仅在学生模型高不确定性的推理分支处进行监督，从而集中这种监督信号。狄利克雷-多项式贝叶斯先验和基础模型KL散度锚点进一步约束离散采样的方差，防止在未监督词元上出现策略崩溃。在多个竞争性基准测试中，OmniOPD在数学任务上比标准OPD方法提升高达+28.64%，验证了片段级语义验证能够提取比词元级对数几率匹配更可靠的学习信号——后者高信息密度被显著噪声和脆性所抵消。此外，当与Claude-4.5-Haiku和Gemini-2.5-Flash等更强的黑盒教师模型配对时，OmniOPD在数学任务上相比其开源权重教师模型额外提升+9.54%，推动学生模型超越自探索式强化学习的性能。

English

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.