PACED:立足学生能力前沿的知識蒸餾法
PACED: Distillation at the Frontier of Student Competence
March 11, 2026
作者: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang
cs.AI
摘要
標準的LLM蒸餾技術在兩個方面存在計算浪費:學生模型已掌握的問題(梯度趨近於零)與遠超其能力的問題(產生混亂梯度並削弱既有能力)。我們證明這種浪費不僅直觀存在,更是結構性必然:蒸餾過程中的梯度信噪比在通過率的兩個極端值處可證明會消失。這一理論觀察催生了Paced框架,其通過從蒸餾梯度邊界消失結構推導出的原則性通過率權重w(p)=p^α(1-p)^β,將蒸餾集中於近側發展區——學生模型能力的前沿地帶。關鍵成果:(1)理論層面:我們證明Beta核函數w(p)=p^α(1-p)^β是蒸餾信噪比結構產生的主導權重族,且具有極小極大魯棒性——在有限乘性誤差設定下,最壞情況效率損失僅為O(δ^2);(2)蒸餾實踐:在採用前向KL散度的大模型向小模型蒸餾中,Paced相較基礎模型實現顯著增益,同時將基準遺忘率維持在低水平;(3)自蒸餾應用:在基於反向KL散度的指令微調模型中,增益效果同樣超越基線;(4)雙階段協同:前向KL接續反向KL的訓練方案在我們設定中取得最優結果,於標準推理基準上實現大幅提升——這支持了蒸餾過程中"模式覆蓋先行,鞏固隨後"的解釋機制。所有配置僅需學生模型推演來估計通過率,無需改變架構,且兼容任意KL散度方向。
English
Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.