透過系統整合式推斷解碼加速強化學習訓練後推演流程

摘要

前沿語言模型的強化學習後訓練正日益受制於自迴歸式展開生成過程，這使得展開加速成為核心系統挑戰。現有許多效率提升方法通過改變展開或優化機制來提高吞吐量，例如採用離策略執行、經驗回放或低精度生成。我們研究將推測解碼作為強化學習展開的無損加速原語，該技術能保持目標模型的輸出分佈。我們在搭載vLLM後端的NeMo-RL中實現了推測解碼，支持同步與異步流水線，並能在強化學習展開過程中進行推測。此優勢可適用於多種推測機制，例如預訓練的MTP頭、小型外部草稿模型，乃至傳統上在強化學習階段後應用的Eagle3等技術。這為在強化學習訓練內部實現尖端推測解碼技術提供了部署路徑。在80億參數規模的同步強化學習推理後訓練工作負載中，推測解碼可將展開吞吐量提升1.8倍。通過高保真性能模擬器，我們預測在2350億參數規模下，結合推測解碼與異步強化學習可實現最高2.5倍的端到端訓練加速。

English

RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

透過系統整合式推斷解碼加速強化學習訓練後推演流程

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

摘要

Support