通过系统集成式推测解码加速强化学习训练后推演

摘要

前沿语言模型的强化学习后训练正日益受限于自回归式轨迹生成，这使轨迹加速成为核心系统挑战。现有效率提升方法多通过改变轨迹生成或优化机制来提高吞吐量，例如采用离策略执行、经验回放或低精度生成。我们研究将推测式解码作为强化学习轨迹的无损加速原语，以保持目标模型的输出分布。我们在搭载vLLM后端的NeMo-RL中实现了推测式解码，支持同步与异步流水线，并能在强化学习轨迹生成过程中实现推测。这一优势可适用于多种推测机制，例如预训练的MTP头部、小型外部草稿模型乃至Eagle3等技术——这些传统上仅应用于强化学习阶段之后的方法，由此为尖端推测式解码技术开辟了在强化学习训练中的部署路径。在80亿参数规模的同步强化学习推理后训练任务中，推测式解码使轨迹吞吐量提升1.8倍。通过高保真性能模拟器预测，在2350亿参数规模下将推测式解码与异步强化学习结合，可实现最高2.5倍的端到端训练加速。

English

RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

通过系统集成式推测解码加速强化学习训练后推演

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

摘要

Support