PACED：前沿学生能力精粹

摘要

标准LLM蒸馏存在双重计算浪费：学生模型已掌握的问题（梯度趋近于零）与远超其能力的问题（破坏现有能力的混沌梯度）。我们证明这种浪费不仅是直观现象，更是结构性的必然：蒸馏中的梯度信噪比在通过率两极可证明地趋近于零。这一理论发现催生了Paced框架，它通过从蒸馏梯度边界消失结构中推导出的原则性通过率权重w(p)=p^α(1-p)^β，将蒸馏聚焦于最近发展区——学生模型能力边界。核心成果：（1）理论层面：我们证明Beta核权重w(p)=p^α(1-p)^β是蒸馏信噪比结构中产生的主导权重族，且具有极小极大鲁棒性——在有限乘性误设下，最坏情况效率损失仅为O(δ^2)；（2）蒸馏实践：在基于前向KL的大模型教师向小模型学生的蒸馏中，Paced在保持基准遗忘率低位的同时实现显著增益；（3）自蒸馏：在采用反向KL的指令调优模型中，增益同样超越基线；（4）双阶段协同：前向KL接续反向KL的训练方案在我们的设定中取得最强效果，在标准推理基准上实现显著提升——这支持了蒸馏过程中“模式覆盖-巩固”的阐释。所有配置仅需学生模型 rollout 估计通过率，无需架构改动，且兼容任意KL方向。

English

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

PACED：前沿学生能力精粹

PACED: Distillation at the Frontier of Student Competence

摘要

Support