Less is More: オンポリシー蒸留のための早期停止ロールアウト

要旨

近年、オン・ポリシー蒸留が、従来のシーケンスレベルの模倣学習に代わる有望な手法として浮上しており、教師モデルを用いて学生自身のロールアウトをスコアリングすることで学生を訓練する。しかしながら、我々はこのパラダイムにおいて「オフ・ポリシー教師減衰」問題を観察する。すなわち、後続のトークンでは、学生の過去の軌跡が教師にとってオフ・ポリシーな文脈となるため、教師が修正的なスコアを生成する能力が減衰し、事前学習段階で学習されたトークン補完行動に後退してしまう可能性がある。我々はこの問題を実験的に検証し、その解決策として早期停止ロールアウト（ESR）を提案する。これは、ロールアウト生成を最初の応答トークンに限定するという単純だが効果的な蒸留戦略である。我々は、ESRがモデルサイズ、モデルファミリー、タスク、訓練設定を横断して、完全ロールアウトOPDの性能を上回ること、さらに、特に異なるモデルファミリー間のシナリオにおいて、はるかに高いGPU効率と訓練安定性を示すことを明らかにする。我々はさらに、この驚くべき性能の背後にあるメカニズムを調査し、ESRの「カスケード調整」効果と「サブモードコミットメント」効果を発見した。これらの効果は、ESRが効果的に機能し、時には教師モデルの性能を上回る理由を説明する可能性がある。加えて、我々はこの位置に基づくトークン選択戦略がKLダイバージェンスとエントロピー信号だけでは完全に説明できないことを示す。

English

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.