EfficientRollout：系統感知的自推測解碼用於強化學習滾動執行

摘要

强化学习（Reinforcement Learning, RL）已成为大语言模型（LLM）中代表性的后训练范式，能够赋予模型强大的推理与智能体能力。然而，展开生成（rollout generation）仍是主要的延迟瓶颈，原因在于自回归采样需要顺序解码响应，且少数长尾生成任务往往决定了完成时间。推测性解码（Speculative Decoding, SD）为此瓶颈提供了天然的解决方案——作为一种成熟的固定LLM服务技术，它通过快速生成候选token并利用并行验证接受这些token来降低延迟，同时保持目标模型的分布特性。然而，其实际加速效果无法直接迁移至RL展开场景，原因有二：（i）不断演化的目标策略会导致任何固定草稿模型与策略输出分布之间的失配日益加剧；（ii）展开解码过程中有效批处理规模逐渐缩小，使解码从计算密集型转向内存密集型，而并行验证恰好能利用未充分利用的计算资源。因此，加速RL展开既需要草稿模型在长序列、高温度的演化策略生成场景下保持有效性，也需要具备系统感知能力的SD使用方式以避免计算密集型阶段。我们提出EfficientRollout——一个面向RL展开的系统感知自推测性解码框架。该框架从目标模型中诱导出一个量化草稿模型（即自推测性解码），使其与演化策略保持耦合，无需独立的草稿模型预训练或在线适配。此外，它进一步协调了系统感知的SD开关策略与基于接受率的草稿长度自适应机制，仅在有利阶段启用推测性解码，并根据草稿模型质量的演化动态调整草稿预算。实验表明，相比加速后的自回归展开基线，EfficientRollout在保持最终模型质量的同时，可将展开延迟与端到端延迟分别降低高达19.6%和12.7%。

English

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.