大規模言語モデルのオンライン蒸留の再考：現象論、メカニズム、手法

要旨

オン方針蒸留（OPD）は大規模言語モデルの学習後処理における核心技術となっているが、その学習ダイナミクスは未だ十分に解明されていない。本論文はOPDのダイナミクスとメカニズムに関する体系的な調査を提供する。我々はまず、OPDの成功・失敗を決定する二つの条件を特定する：（i）生徒モデルと教師モデルは互換性のある思考パターンを共有すべきである；（ii）思考パターンが一貫しておりスコアが高くても、教師は生徒が学習中に経験した範囲を超えた真に新たな能力を提供しなければならない。これらの知見を弱から強への逆蒸留によって検証し、同一ファミリーの1.5Bと7B教師モデルが生徒の視点から分布的に区別不能であることを示す。トークンレベルメカニズムの詳細な分析により、成功するOPDは、生徒が訪問した状態における高確率トークンへの漸進的アライメントによって特徴づけられ、これは確率質量の大部分（97%-99%）を集中させる少数の共有トークン集合で構成されることを明らかにする。さらに、失敗したOPDを回復する二つの実用的戦略——オフ方針コールドスタートと教師整合プロンプト選択——を提案する。最後に、OPDがもたらす見かけ上の「無償の利益」（高密度なトークンレベル報酬）には代償が伴い、OPDが長期的な蒸留にスケールできるかという疑問を提起する。

English

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

大規模言語モデルのオンライン蒸留の再考：現象論、メカニズム、手法

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

要旨

Support