미리 내다보는 학습: 온-정책 증류의 효율성 개방을 밝히다

초록

온-정책 증류(OPD)는 대규모 언어 모델을 위한 효율적인 사후 학습 패러다임으로 부상하였다. 그러나 기존 연구들은 이러한 이점을 주로 더 조밀하고 안정적인 감독(supervision)에 귀속시키는 반면, OPD의 효율성 이면에 있는 매개변수 수준의 메커니즘은 아직 제대로 이해되지 않고 있다. 본 연구에서는 OPD의 효율성이 일종의 '예지력'에서 비롯된다고 주장한다. 즉, OPD는 학습 초기에 최종 모델을 향한 안정적인 업데이트 궤적을 확립한다는 것이다. 이러한 예지력은 두 가지 측면에서 나타난다. 첫째, 모듈 할당 수준(Module-Allocation Level)에서 OPD는 한계 효용이 낮은 영역을 식별하고 추론에 더 중요한 모듈에 업데이트를 집중시킨다. 둘째, 업데이트 방향 수준(Update-Direction Level)에서 OPD는 더 강한 저랭크 집중(low-rank concentration)을 보이며, 학습 초기에 그 지배적 부분공간이 최종 업데이트 부분공간과 밀접하게 정렬된다. 이러한 발견을 바탕으로, 본 연구에서는 외삽 스텝 크기를 적응적으로 선택하고 현재 업데이트 방향을 따라 이동함으로써 OPD를 가속화하는 플러그 앤 플레이(plug-and-play) 방식의 가속 방법인 EffOPD를 제안한다. EffOPD는 추가적인 학습 가능 모듈이나 복잡한 하이퍼파라미터 튜닝이 필요하지 않으며, 최종 성능을 유지하면서 평균 3배의 학습 가속을 달성한다. 전반적으로, 본 연구의 결과는 OPD의 효율성을 이해하기 위한 매개변수 동역학 관점을 제공하며, 대규모 언어 모델을 위한 보다 효율적인 사후 학습 방법을 설계하는 데 실질적인 통찰력을 제시한다.

English

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the Module-Allocation Level, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the Update-Direction Level, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose EffOPD, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of 3times while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.