Lightning OPD: 大規模推論モデルのための効率的なポストトレーニング - オフライン方策蒸留によるアプローチ

要旨

オン方針蒸留（OPD）は大規模言語モデルの効率的な学習後訓練パラダイムとして登場した。しかし、標準的なOPDでは訓練全体を通じてライブの教師モデル推論サーバーが必要であり、多大なインフラストラクチャーオーバーヘッドが生じる。本研究では、オン方針蒸留をオフラインで実行できるかどうかを検討する。自然なアプローチとして、SFTロールアウトに対して教師の対数確率を一度事前計算し、訓練中に再利用する方法が考えられる。しかし実際には、このオフライン版は標準OPDの性能を確実に再現できない。この不一致を理解するため、我々はあらゆるOPDパイプラインにとって極めて重要であるにもかかわらず従来看過されてきた条件を特定し、これを「教師一貫性」と名付ける。この条件は、教師指導付ききめ細かい調整とOPDの両方に同一の教師モデルを使用することを要求する。教師一貫性が侵害されると、還元不可能な勾配バイアスが導入され、オフライン・オンライン双方のOPDが訓練期間に関わらず最適ではない固定点に収束することを示す。この知見に基づき、我々はLightning OPDを提案する。これはSFTロールアウト上で教師の対数確率を事前計算することで教師一貫性を強制する、オフラインのオン方針蒸留フレームワークである。この設計により、ライブ教師サーバーが完全に不要となる。さらに、教師一貫性が保たれる条件下では、Lightning OPDは標準OPDと同一の最適点を共有し、有界な勾配不一致と、方針ドリフトの防止に寄与する暗黙的正則化効果を持つことを示す。数学的推論とコード生成における大規模な実験により、Lightning OPDが大幅に改善された効率性で最先端の性能を達成することが実証された。SFT初期化済みのQwen3-8B-Baseモデルを出発点として、Lightning OPDはわずか30GPU時間でAIME 2024において69.9%を達成し、標準OPDに対して4.0倍の高速化を実現、LLM学習後訓練に関する学術研究の参入障壁を大幅に低減した。

English

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

Lightning OPD: 大規模推論モデルのための効率的なポストトレーニング - オフライン方策蒸留によるアプローチ

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

要旨

Support