システム統合型投機的デコーディングによる強化学習ポストトレーニング展開の高速化

要旨

フロンティア言語モデルの強化学習（RL）事後学習において、オート回帰的なロールアウト生成がボトルネックとなることが増えており、ロールアウトの高速化は重要なシステム課題となっている。既存の効率化手法の多くは、オフポリシー実行、リプレイ、低精度生成など、ロールアウトや最適化の方法を変更することでスループットを向上させている。本研究では、RLロールアウトにおける損失のない高速化プリミティブとして、対象モデルの出力分布を保持する投機的デコードを検討する。我々は、vLLMバックエンドを用いたNeMo-RLに投機的デコードを実装し、同期および非同期パイプラインをサポートし、RLロールアウト中の投機的実行を可能にした。この利点は、事前学習済みMTPヘッド、小型外部ドラフトモデル、あるいは従来はRLフェーズ後に適用されてきたEagle3のような技術など、様々な投機的実行メカニズムにわたって実現可能である。これにより、RL訓練内部での最先端の投機的デコードの実用化への道が開ける。同期RL下での80億パラメータ規模の推論事後学習ワークロードにおいて、投機的デコードはロールアウトスループットを1.8倍向上させた。高精度な性能シミュレータを用いた検証により、投機的デコードと非同期RLを組み合わせることで、2350億パラメータ規模においてエンドツーエンドの訓練速度が最大2.5倍向上することが見込まれる。

English

RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

システム統合型投機的デコーディングによる強化学習ポストトレーニング展開の高速化

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

要旨

Support