FP8强化学习:面向大语言模型强化学习的实用稳定低精度技术栈
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
January 26, 2026
作者: Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, Junjie Lai
cs.AI
摘要
针对大语言模型的强化学习正日益受限于生成环节,长输出序列导致注意力机制与KV缓存内存占据端到端步骤时间的主导地位。FP8通过降低生成阶段的计算成本与内存流量,为加速强化学习提供了诱人的技术杠杆。然而在强化学习中应用FP8存在独特的工程与算法挑战:策略权重每步更新(需要重复量化并同步至推理引擎),且低精度生成可能偏离训练器预设的高精度策略,导致训练-推理失配及潜在不稳定性。本报告提出适用于大语言模型强化学习的实用FP8生成技术栈,在veRL生态系统中实现并支持主流训练后端(如FSDP/Megatron-LM)与推理引擎(如vLLM/SGLang)。我们(1)采用分块FP8量化实现W8A8线性层生成;(2)通过逐步QKV缩放因子重校准将FP8扩展至KV缓存,消除长上下文内存瓶颈;(3)运用基于重要性采样的生成校正方法(词元级TIS/MIS变体)缓解失配问题。在稠密与混合专家模型中,这些技术可实现最高44%的生成吞吐量提升,同时保持与BF16基线相当的学习效果。
English
Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.