密な監督、疎な更新：オン方策蒸留の疎性と幾何学について

要旨

オン方策蒸留（OPD）は、オン方策の学生軌跡と密な教師監督という二つの望ましい要素を組み合わせているため、近年、ポストトレーニングの主要な手法となっているが、このハイブリッドがモデルのパラメータをどのように変化させるかは依然として不明である。いくつかの言語モデルおよび視覚言語モデルのペアと使用事例にわたる我々の分析により、二つの主要な知見が得られた。スパース性に関して、OPDスタイルの更新は小さく、座標方向にスパースである。これらの更新は層全体に分散しており、通常はFFN（フィードフォワードネットワーク）に偏っている。このスパースな構造は運用上有用であり、発見されたサブネットワークのみを訓練することで、完全なOPDとほぼ同等の性能が回復される。しかし、我々のオプティマイザアブレーションでは、スパース性を誘導するSGDオプティマイザはAdamWよりも性能が劣る。これはおそらく、密な教師監督が異質な座標方向の勾配スケールを保持し、AdamWの適応的スケーリングが依然として有効であるためである。幾何学的形状に関しては、更新は数値的にフルランクであるがスペクトル的に集中しており、それらは主としてソース重みの主要特異部分空間から離れた位置にあり、ソース重みがゼロに近い座標に不均衡に集中する。これらの知見は、密な教師監督がOPDを通常の密なパラメータ書き換えに変えるわけではなく、むしろOPDがオン方策ポストトレーニングの重要な幾何学的特徴を保持していることを示唆している。

English

On-policy distillation (OPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, OPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full OPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn OPD into ordinary dense parameter rewriting; instead, OPD retains important geometric signatures of on-policy post-training.