PACED：立足学生能力前沿的知識蒸餾法

摘要

標準的LLM蒸餾技術在兩個方面存在計算浪費：學生模型已掌握的問題（梯度趨近於零）與遠超其能力的問題（產生混亂梯度並削弱既有能力）。我們證明這種浪費不僅直觀存在，更是結構性必然：蒸餾過程中的梯度信噪比在通過率的兩個極端值處可證明會消失。這一理論觀察催生了Paced框架，其通過從蒸餾梯度邊界消失結構推導出的原則性通過率權重w(p)=p^α(1-p)^β，將蒸餾集中於近側發展區——學生模型能力的前沿地帶。關鍵成果：（1）理論層面：我們證明Beta核函數w(p)=p^α(1-p)^β是蒸餾信噪比結構產生的主導權重族，且具有極小極大魯棒性——在有限乘性誤差設定下，最壞情況效率損失僅為O(δ^2)；（2）蒸餾實踐：在採用前向KL散度的大模型向小模型蒸餾中，Paced相較基礎模型實現顯著增益，同時將基準遺忘率維持在低水平；（3）自蒸餾應用：在基於反向KL散度的指令微調模型中，增益效果同樣超越基線；（4）雙階段協同：前向KL接續反向KL的訓練方案在我們設定中取得最優結果，於標準推理基準上實現大幅提升——這支持了蒸餾過程中"模式覆蓋先行，鞏固隨後"的解釋機制。所有配置僅需學生模型推演來估計通過率，無需改變架構，且兼容任意KL散度方向。

English

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

PACED：立足学生能力前沿的知識蒸餾法

PACED: Distillation at the Frontier of Student Competence

摘要

Support