Éclair OPD : Une post-formation efficace pour les grands modèles de raisonnement par distillation hors ligne sur politique

Résumé

La distillation en ligne (OPD) s'est imposée comme un paradigme efficace pour le post-entraînement des grands modèles de langage. Cependant, l'approche standard nécessite un serveur d'inférence enseignant actif pendant tout l'entraînement, entraînant des coûts infrastructurels substantiels. Dans ce travail, nous étudions si la distillation en ligne peut être réalisée hors ligne. Une approche naturelle consiste à précalculer une fois les log-probabilités de l'enseignant sur les épisodes de SFT et à les réutiliser pendant l'entraînement. En pratique, cependant, cette variante hors ligne échoue à égaler de façon fiable les performances de l'OPD standard. Pour comprendre cet écart, nous identifions une condition jusqu'alors négligée, cruciale pour tout pipeline OPD, que nous nommons la cohérence de l'enseignant. Cette condition exige que le même modèle enseignant soit utilisé à la fois pour le réglage fin supervisé (SFT) et pour l'OPD. Nous montrons que violer cette cohérence introduit un biais de gradient irréductible, conduisant l'OPD hors ligne et en ligne à converger vers un point fixe sous-optimal, indépendamment de la durée de l'entraînement. Forts de cette analyse, nous proposons Lightning OPD, un cadre de distillation en ligne hors ligne qui garantit la cohérence de l'enseignant en précalculant les log-probabilités sur les épisodes de SFT. Cette conception élimine entièrement le besoin d'un serveur enseignant actif. Nous démontrons en outre que, sous condition de cohérence de l'enseignant, Lightning OPD partage le même optimum que l'OPD standard, avec un écart de gradient borné et un effet de régularisation implicite aidant à prévenir la dérive de la politique. Des expériences approfondies en raisonnement mathématique et génération de code montrent que Lightning OPD atteint des performances de pointe avec une efficacité bien supérieure. En partant d'un modèle Qwen3-8B-Base initialisé par SFT, Lightning OPD atteint 69,9 % sur AIME 2024 en seulement 30 heures GPU, réalisant une accélération de 4,0x par rapport à l'OPD standard et abaissant considérablement la barrière d'entrée pour la recherche académique sur le post-entraînement des LLM.

English

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

Éclair OPD : Une post-formation efficace pour les grands modèles de raisonnement par distillation hors ligne sur politique

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Résumé

Support