大規模言語モデルにおけるオン方針蒸留に関する調査

要旨

知識蒸留は、最先端の大規模言語モデル（LLM）から、より小型で実運用可能な学生モデルへ、推論能力や専門知識を移転する主要な手法として確立されつつある。しかし、現在支配的なパラダイムは依然としてオフポリシーである。すなわち、学生モデルは教師モデルが生成した静的なデータで学習し、学習過程中に自身の誤りに直面することはない。この学習と推論の不一致（暴露バイアスの一例）は、推論時に予測誤差が自己回帰的に増幅する原因となる。オンポリシー蒸留（OPD）はこの問題に対処する。学生モデルが自身で軌跡を生成し、その自己生成された出力に対して教師モデルからのフィードバックを受けることで、知識蒸留をインタラクティブな模倣学習の理論に基づいて行う。分散最小化、報酬誘導学習、自己対戦など、OPDの研究は急速に拡大しているものの、その文献は断片的で統一的な理論的枠組みを欠いている。本調査は、LLMにおけるOPDの初の包括的概観を提供する。我々は、オンポリシーサンプルに基づく統一的なf-ダイバージェンスの枠組みを導入し、この分野を以下の三つの直交する次元に沿って整理する：フィードバック信号（ロジットベース、結果ベース、自己対戦）、教師モデルへのアクセス（ホワイトボックス、ブラックボックス、教師なし）、損失の粒度（トークンレベル、シーケンスレベル、ハイブリッド）。代表的な手法を体系的に分析し、産業界での実装を検討し、蒸留のスケーリング則、不確実性を考慮したフィードバック、エージェントレベルの蒸留などの未解決問題を明らかにする。

English

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains off-policy: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of exposure bias, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified f-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: feedback signal (logit-based, outcome-based, or self-play), teacher access (white-box, black-box, or teacher-free), and loss granularity (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

大規模言語モデルにおけるオン方針蒸留に関する調査

A Survey of On-Policy Distillation for Large Language Models

要旨

Support