PACED: 학생의 역량 한계에서의 지식 증류

초록

표준 LLM 증류는 두 가지 측면에서 계산 자원을 낭비합니다: 학생 모델이 이미 숙달한 문제(기울기 근사치가 0에 수렴)와 학생 모델의 역량을 훨씬 넘어서는 문제(기존 능력을 훼손하는 비일관적인 기울기)입니다. 본 연구는 이러한 낭비가 단순히 직관적인 문제를 넘어 구조적으로 필연적임을 보입니다: 증류 과정의 기울기 신호 대 잡음비는 합격률 분포의 양극단에서 이론적으로 소멸함을 증명합니다. 이러한 이론적 관찰을 바탕으로, 우리는 증류 기울기의 경계 소멸 구조에서 도출된 원리 기반 합격률 가중치 w(p) = p^α(1 - p)^β를 통해 증류를 학생 모델의 역량 최전방인 근접 발달 영역에 집중하는 프레임워크인 Paced를 제안합니다. 주요 결과: (1) 이론: Beta 커널 w(p) = p^α(1-p)^β가 증류의 신호 대 잡음비 구조에서 발생하는 주도적인 가중치 함수군임을 증명하며, 이 가중치가 미니맥스 강건성을 가짐을 보입니다(유계 곱셈적 오규격 하에서 최악의 경우 효율성 손실은 O(δ^2)에 불과함). (2) 증류: 더 큰 교사 모델에서 더 작은 학생 모델로의 순방향 KL 증류에서 Paced는 기준 모델 대비 상당한 성능 향상을 달성하면서도 벤치마크 망각 현상을 낮은 수준으로 유지했습니다. (3) 자가 증류: 역방향 KL을 사용한 지시 튜닝 모델에서도 기준선을 능가하는 성능 향상을 보였습니다. (4) 두 단계 시너지: 순방향 KL 이후 역방향 KL을 적용하는 단계별 접근법이 우리 실험 설정에서 가장 강력한 결과를 보여, 증류 과정을 모드 Coverage 이후 Consolidation으로 해석하는 관점을 지지하며 표준 추론 벤치마크에서 상당한 개선을 달성했습니다. 모든 구성은 합격률 추정을 위해 학생 모델의 롤아웃만 필요하며, 아키텍처 변경이 불필요하고 모든 KL 방향과 호환됩니다.

English

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^α(1 - p)^β derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^α(1-p)^β is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(δ^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

PACED: 학생의 역량 한계에서의 지식 증류

PACED: Distillation at the Frontier of Student Competence

초록

Support