AsyncOPD: オン方策蒸留はどの程度古くなりうるのか？

要旨

オンポリシー蒸留（OPD）は、教師からのフィードバックに導かれて生徒モデルを自身のロールアウト上で学習させる手法であり、大規模言語モデル（LLM）の事後学習において重要性が増している。しかし、強化学習（RL）と同様に、OPDはオンポリシーシステムのボトルネックに直面している。これは、推論ワークロードにおいてロールアウトが学習時間の大半を占める可能性があるためである。非同期学習パイプラインは、ロールアウト生成と学習器の更新を分離することでこのボトルネックを緩和できるが、その代償として古い方策のデータ（スティールデータ）を導入することになる。先行研究では非同期RLにおけるスティールデータが研究されてきたが、OPDにおけるその影響は未だ十分に調査されていない。本論文では、非同期OPDにおけるスティールネス（古さ）の初めての体系的な研究を提示する。特に、教師フィードバックが局所的なKL損失によって実装され、全語彙の教師ロジットを保存または転送するにはコストが高すぎるため、有限の教師スコアキャッシュが必要となる実用的な設定に焦点を当てる。まず、KLの方向がスティールデータ問題を変化させることを示す。すなわち、教師重み付き順方向KLは古いロールアウトに対してよりロバストである一方、生徒重み付き逆方向KLは脆弱である。次に、この脆弱な逆方向KLの場合について、非同期RLを安定化するために設計された手法がOPDのスティールネスを緩和できるかどうかを研究する。実験では、これらの手法はより単純なOPD固有の代理手法、すなわち学習器の時点で現在の生徒モデルの下で逆方向KL信号を再計算する手法、よりも改善を示さなかった。第三に、有限の教師スコアキャッシュがスパースでサンプリングされた逆方向KL OPD推定器に対してバイアス・バリアンストレードオフを生み出す方法を分析する。このことは、マルチサンプルモンテカルロ（MC）を動機付ける。これはMC補正可能性を維持しながら、1サンプルの分散を低減する。最後に、これらの推定器の選択に基づいて構築された完全非同期OPD学習パイプラインであるAsyncOPDを提示し、オープンソース化する。実験により、AsyncOPDは厳密な同期学習と比較して1.6倍から3.8倍の学習スループットを達成し、かつ同等の精度を達成することが示された。

English

On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale-policy data. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher-score caches. We first show that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable. Second, for this vulnerable reverse-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness. In our experiments, they do not improve over a simpler OPD-specific surrogate: recomputing the reverse-KL signal under the current student at learner time. Third, we analyze how finite teacher-score caches create a bias-variance tradeoff for sparse and sampled reverse-KL OPD estimators. This motivates multi-sample Monte Carlo (MC), which preserves MC correctability while reducing one-sample variance. Finally, we present and open-source AsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices. Experiments show that AsyncOPD improves training throughput by 1.6times to 3.8times over strict synchronous training while reaching comparable accuracy.