번개 OPD: 오프라인 온-정책 지식 증류를 통한 대규모 추론 모델의 효율적 사후 학습

초록

온-폴리시 지식 증류(OPD)는 대규모 언어 모델의 사후 훈련을 위한 효율적인 패러다임으로 부상했습니다. 그러나 표준 OPD는 훈련 전반에 걸쳐 실시간 교사 모델 추론 서버를 요구하므로 상당한 인프라 오버헤드가 발생합니다. 본 연구에서는 온-폴리시 지식 증류를 오프라인으로 수행할 수 있는지 조사합니다. 자연스러운 접근법은 SFT 롤아웃에 대해 교사 로그 확률을 한 번 미리 계산하여 훈련 중 재사용하는 것입니다. 그러나 실제로 이 오프라인 변형은 표준 OPD의 성능을 안정적으로 따라가지 못합니다. 이러한 차이를 이해하기 위해 우리는 모든 OPD 파이프라인에 중요한 이전에 간과된 조건을 규명했으며, 이를 '교사 일관성'이라고 명명합니다. 이 조건은 지도 미세 조정과 OPD 모두에 동일한 교사 모델이 사용되어야 함을 요구합니다. 우리는 교사 일관성이 위반되면 감소 불가능한 그래디언트 편향이 발생하여 오프라인 및 온라인 OPD 모두 훈련 기간에 관계없이 차선의 고정점으로 수렴하게 됨을 보여줍니다. 이러한 통찰을 바탕으로, 우리는 SFT 롤아웃에 대한 교사 로그 확률을 미리 계산함으로써 교사 일관성을 강제하는 오프라인 온-폴리시 지식 증류 프레임워크인 Lightning OPD를 제안합니다. 이 설계는 실시간 교사 서버의 필요성을 완전히 제거합니다. 또한 우리는 교사 일관성 하에서 Lightning OPD가 표준 OPD와 동일한 최적점을 공유하며, 유계 그래디언트 불일치와 정책 이탈 방지에 도움이 되는 암묵적 정규화 효과를 가짐을 보여줍니다. 수학적 추론 및 코드 생성에 대한 광범위한 실험을 통해 Lightning OPD가 상당히 향상된 효율성으로 최첨단 성능을 달성함을 입증했습니다. SFT로 초기화된 Qwen3-8B-Base 모델을 시작점으로, Lightning OPD는 단 30 GPU 시간 만에 AIME 2024에서 69.9%에 도달하여 표준 OPD 대비 4.0배의 속도 향상을 달성하고 LLM 사후 훈련에 대한 학술 연구의 진입 장벽을 크게 낮췄습니다.

English

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

번개 OPD: 오프라인 온-정책 지식 증류를 통한 대규모 추론 모델의 효율적 사후 학습

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

초록

Support