密集監督，稀疏更新：論在策略蒸餾的稀疏性與幾何

摘要

近年來，在策略蒸餾（On-policy distillation, OPD）已成為一種重要的後訓練流程，因為它結合了兩個理想的要素：在策略的學生軌跡與密集的教師監督。然而，這種混合方式如何改變模型的參數，至今仍不清楚。針對多組語言模型與視覺-語言模型配對及多種使用場景，我們的分析得出兩項主要發現。在稀疏性方面，OPD 風格的更新幅度小且在座標上呈現稀疏性。這些更新分布在網路各層，通常以前饋網路（FFN）部分的比重較大。這種稀疏結構在實務上有用：僅訓練所發現的子網路，即可達到近乎完整的 OPD 性能。然而，在我們的最佳化器消融實驗中，誘導稀疏性的 SGD 最佳化器表現不如 AdamW，原因可能在於密集的教師監督保留了異質的逐座標準則梯度尺度，而 AdamW 的自適應尺度在此情況下仍具效用。在幾何結構方面，這些更新在數值上是滿秩的，但在頻譜上高度集中；它們大多偏離原始權重的主要奇異子空間，並不成比例地落在原始權重接近零的座標上。這些發現表明，密集教師監督並未使 OPD 轉變為一般的密集參數改寫；相反地，OPD 保留了在策略後訓練的重要幾何特徵。

English

On-policy distillation (OPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, OPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full OPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn OPD into ordinary dense parameter rewriting; instead, OPD retains important geometric signatures of on-policy post-training.