ChatPaper.aiChatPaper

闪电OPD:基于离线策略蒸馏的大型推理模型高效后训练方法

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

April 14, 2026
作者: Yecheng Wu, Song Han, Hai Cai
cs.AI

摘要

在策略蒸馏(OPD)已成为大语言模型后训练的高效范式。然而,标准OPD在整个训练过程中需要实时运行的教师模型推理服务器,导致显著的基建开销。本研究旨在探索在策略蒸馏能否以离线方式实现。一种自然思路是预先计算教师模型在SFT轨迹上的对数概率,并在训练中重复使用。但实践表明,这种离线方案难以稳定达到标准OPD的性能水平。为解析此差异,我们发现了在策略蒸馏流程中至关重要却长期被忽视的条件——教师一致性,该条件要求监督微调与OPD阶段必须使用相同的教师模型。我们证明违反教师一致性会引入不可约的梯度偏差,导致离线和在线OPD无论训练时长如何都会收敛至次优定点。基于此发现,我们提出闪电OPD框架,通过预计算SFT轨迹的教师对数概率来保证教师一致性,从而完全摆脱对实时教师服务器的依赖。进一步研究表明,在满足教师一致性的前提下,闪电OPD与标准OPD具有相同的最优点,且具备有界的梯度差异及有助于防止策略漂移的隐式正则化效应。在数学推理和代码生成任务上的大量实验表明,闪电OPD以显著提升的效率实现了最先进性能:从SFT初始化的Qwen3-8B基础模型出发,仅用30 GPU小时即在AIME 2024上达到69.9%的准确率,相比标准OPD加速4.0倍,大幅降低了LLM后训练学术研究的入门门槛。
English
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.
PDF40April 16, 2026