EfficientRollout: システム認識型自己投機的デコードによる強化学習ロールアウト

要旨

強化学習（RL）は、LLMの代表的なポストトレーニングパラダイムとなり、強力な推論能力とエージェント的能力を実現しています。しかし、ロールアウト生成は依然としてレイテンシの主要なボトルネックとなっています。なぜなら、自己回帰的サンプリングは応答を逐次的にデコードし、少数の長尾生成が完了時間を決定づけることが多いからです。投機的解読（SD）は、このボトルネックに対処する自然な方法を提供します。なぜなら、これは固定LLMを提供するための確立された技術であり、トークンを迅速にドラフトし、並列検証を通じてそれらを受け入れることでレイテンシを削減しつつ、ターゲットモデルの分布を維持するからです。しかし、その実用的な高速化はRLロールアウトに直接には引き継がれません。（i）進化するターゲットポリシーにより、固定されたドラフターはポリシーの出力分布とのミスマッチが次第に大きくなります。（ii）ロールアウトデコード全体でアクティブバッチサイズが縮小し、デコードが計算主体からメモリ主体の領域へと移行します。後者では、並列検証が未活用の計算能力を活用できる可能性があります。したがって、RLロールアウトを加速するには、進化するポリシーによる長く高温の生成下でも効果を維持できるドラフターと、計算主体の領域を回避するシステム認識型のSD活用の両方が必要です。本稿では、このギャップに対処するために設計されたシステム認識型自己SDフレームワークであるEfficientRolloutを提案します。EfficientRolloutは、ターゲットモデルから量子化ドラフターを誘導し（すなわち自己投機的解読）、別途ドラフターの事前学習やオンライン適応を行うことなく、進化するポリシーと連動させます。さらに、受入認識型ドラフト長適応と組み合わせたシステム認識型SDトグルポリシーを調整し、ドラフト予算を進化するドラフター品質に合わせつつ、有益な領域でのみ投機を可能にします。EfficientRolloutは、高速化されたARロールアウトベースラインと比較して、ロールアウトレイテンシを最大19.6%、エンドツーエンドレイテンシを最大12.7%削減し、最終的なモデル品質を維持します。

English

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.