PACED: 学生の能力限界における知識蒸留

要旨

標準的なLLM蒸留では、計算リソースが二つの面で無駄になっている。一つは生徒モデルが既に習得した問題（勾配がほぼゼロ）、もう一つは生徒モデルの能力を大きく超える問題（既存の能力を損なう無秩序な勾配）である。本論文では、この無駄が直感的に理解できるだけでなく、構造的に不可避であることを示す。すなわち、蒸留における勾配の信号対雑音比は、合格率の両極端で理論的に消失することが証明される。この理論的観察から導かれたのがPacedである。これは、蒸留勾配の境界消失構造から導出された原理的な合格率重み関数 w(p) = p^α(1 - p)^β を用いて、蒸留を発達の最近接領域、すなわち生徒モデルの能力の最先端に集中させるフレームワークである。主な成果は以下の通り。(1) 理論: Betaカーネル w(p) = p^α(1-p)^β が蒸留のSNR構造から生じる主要な重み関数族であり、これがミニマックス頑健性を持つことを証明する。具体的には、有界な乗法的誤設定の下で、最悪ケースの効率損失はわずか O(δ^2) に留まる。(2) 蒸留: 大規模な教師モデルから小規模な生徒モデルへの順方向KLを用いた蒸留において、Pacedはベースモデルを大幅に上回る性能向上を達成し、ベンチマークでの忘却を低水準に抑える。(3) 自己蒸留: 逆方向KLを用いた指示チューニング済みモデルにおいても、ベースラインを上回る性能向上が得られる。(4) 二段階の相乗効果: 順方向KL、その後逆方向KLというスケジュールは、我々の設定で最強の結果をもたらし、標準的な推論ベンチマークで大幅な改善を達成する。これは蒸留プロセスを「モードの網羅、その後定着」と解釈する見方を支持する。全ての構成では、合格率の推定に生徒モデルのロールアウトのみを必要とし、アーキテクチャの変更は不要で、あらゆるKLの方向性と互換性がある。

English

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

PACED: 学生の能力限界における知識蒸留

PACED: Distillation at the Frontier of Student Competence

要旨

Support