PACED:前沿学生能力精粹
PACED: Distillation at the Frontier of Student Competence
March 11, 2026
作者: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang
cs.AI
摘要
标准LLM蒸馏存在双重计算浪费:学生模型已掌握的问题(梯度趋近于零)与远超其能力的问题(破坏现有能力的混沌梯度)。我们证明这种浪费不仅是直观现象,更是结构性的必然:蒸馏中的梯度信噪比在通过率两极可证明地趋近于零。这一理论发现催生了Paced框架,它通过从蒸馏梯度边界消失结构中推导出的原则性通过率权重w(p)=p^α(1-p)^β,将蒸馏聚焦于最近发展区——学生模型能力边界。核心成果:(1)理论层面:我们证明Beta核权重w(p)=p^α(1-p)^β是蒸馏信噪比结构中产生的主导权重族,且具有极小极大鲁棒性——在有限乘性误设下,最坏情况效率损失仅为O(δ^2);(2)蒸馏实践:在基于前向KL的大模型教师向小模型学生的蒸馏中,Paced在保持基准遗忘率低位的同时实现显著增益;(3)自蒸馏:在采用反向KL的指令调优模型中,增益同样超越基线;(4)双阶段协同:前向KL接续反向KL的训练方案在我们的设定中取得最强效果,在标准推理基准上实现显著提升——这支持了蒸馏过程中“模式覆盖-巩固”的阐释。所有配置仅需学生模型 rollout 估计通过率,无需架构改动,且兼容任意KL方向。
English
Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.