密集监督，稀疏更新：论在线策略蒸馏的稀疏性与几何结构

摘要

在线策略蒸馏（OPD）近期已成为一种重要的后训练范式，因其融合了两大理想特性：在线策略的学生轨迹与密集的教师监督。然而，这种混合机制如何改变模型参数尚不清楚。通过分析多组语言模型与视觉-语言模型对及其使用场景，本研究得出两个主要发现。关于稀疏性：OPD式更新具有小幅度且坐标稀疏的特点，这些更新分布在各层之间，且通常以FFN层为主。这种稀疏结构具有实际应用价值：仅训练所发现的子网络即可恢复接近完整OPD的性能。然而，在优化器消融实验中，稀疏性诱导的SGD优化器表现逊于AdamW，其原因在于密集教师监督保留了异构的坐标级梯度尺度，而AdamW的自适应缩放机制在此场景下仍具优势。关于几何特性：这些更新在数值上具有满秩结构但谱集中分布；它们主要偏离源权重的主奇异子空间，且不成比例地落在源权重接近零的坐标上。这些发现表明，密集教师监督并未将OPD转变为普通的密集参数重写，相反，OPD保留了在线策略后训练的重要几何特征。

English

On-policy distillation (OPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, OPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full OPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn OPD into ordinary dense parameter rewriting; instead, OPD retains important geometric signatures of on-policy post-training.