EfficientRollout: 系统感知的自推测解码用于RL展开

摘要

强化学习（RL）已成为大型语言模型（LLM）代表性的后训练范式，使其具备强大的推理和智能体能力。然而，rollout生成仍是主要的延迟瓶颈，因为自回归采样需要顺序解码响应，而少数长尾生成往往决定了完成时间。推测解码（SD）自然解决了这一瓶颈——作为一种成熟的服务固定LLM的技术，它通过快速草拟令牌并通过并行验证接受令牌来降低延迟，同时保持目标模型分布。然而，其实际加速效果无法直接迁移到RL rollout中：（i）不断演化的目标策略使得任何固定草拟器与策略输出分布之间的不匹配加剧；（ii）rollout解码过程中活跃批量大小逐步缩小，使解码从计算受限状态转向内存受限状态，此时并行验证可充分利用未利用的计算资源。因此，加速RL rollout需要一种能在演化策略的长序列、高温度生成中保持有效的草拟器，以及一种系统感知的SD使用方式，以避免陷入计算受限状态。我们提出EfficientRollout，一个系统感知的自推测解码框架，旨在填补这一空白。EfficientRollout从目标模型中诱导出一个量化草拟器（即自推测解码），使其与演化策略紧密耦合，无需单独预训练草拟器或进行在线适配。它还协调了一种系统感知的SD切换策略与接受感知的草稿长度调整机制，仅在有利状态下启用推测，同时将草拟预算与演化中的草拟器质量相匹配。与加速的自回归rollout基线相比，EfficientRollout将rollout延迟和端到端延迟分别降低高达19.6%和12.7%，同时保持最终模型质量。

English

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.