FP8-RL：面向大语言模型强化学习的实用稳定低精度技术栈

摘要

针对大型语言模型（LLM）的强化学习（RL）正日益受限于生成环节，长输出序列导致注意力机制与KV缓存内存成为端到端步骤时间的主要瓶颈。FP8通过降低生成过程中的计算成本和内存流量，为加速RL提供了有效途径，但其应用面临独特的工程与算法挑战：策略权重每步更新（需重复量化和向推理引擎同步权重），且低精度生成可能偏离训练器预设的高精度策略，引发训练-推理失配及潜在不稳定问题。本报告提出一套实用的LLM强化学习FP8生成技术栈，在veRL生态中实现并兼容主流训练后端（如FSDP/Megatron-LM）与推理引擎（如vLLM/SGLang）。我们（i）采用分块FP8量化实现W8A8线性层FP8生成；（ii）通过逐步QKV缩放因子重校准将FP8扩展至KV缓存，消除长上下文内存瓶颈；（iii）运用基于重要性采样的生成校正（词级TIS/MIS变体）缓解失配问题。在稠密与MoE模型上的实验表明，该技术可在保持与BF16基线相当学习效果的同时，实现高达44%的生成吞吐量提升。

English

Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.

FP8-RL：面向大语言模型强化学习的实用稳定低精度技术栈

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

摘要

Support