ChatPaper.aiChatPaper

闪电OPD:基于离线策略蒸馏的大型推理模型高效后训练方法

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

April 14, 2026
作者: Yecheng Wu, Song Han, Hai Cai
cs.AI

摘要

在线策略蒸馏(OPD)已成为大语言模型后训练的高效范式。然而,标准OPD在整个训练过程中需要实时运行的教师模型推理服务器,导致显著的基础设施开销。本研究探索了离线执行在线策略蒸馏的可行性。一种自然思路是预先计算教师模型在SFT轨迹上的对数概率并在训练中重复使用,但实际应用中这种离线方案无法稳定达到标准OPD的性能。为解析这一差异,我们发现了OPD流程中一个被忽视的关键条件——教师一致性,即要求监督微调(SFT)和OPD必须使用同一教师模型。研究表明,违反该条件会引入不可约的梯度偏差,导致无论训练时长如何,离线和在线OPD都会收敛至次优定点。基于此,我们提出Lightning OPD框架,通过预计算SFT轨迹的教师对数概率来确保教师一致性,从而完全消除实时教师服务器的需求。进一步证明在教师一致性条件下,Lightning OPD与标准OPD具有相同最优解,其梯度差异有界,且隐含的正则化效应可防止策略漂移。在数学推理和代码生成任务上的大量实验表明,Lightning OPD以显著提升的效率达到最优性能:从SFT初始化的Qwen3-8B-Base模型出发,仅用30 GPU小时即在AIME 2024上达到69.9%的准确率,相比标准OPD实现4倍加速,大幅降低了LLM后训练的学术研究门槛。
English
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.
PDF40April 16, 2026