Jet-RL:通过统一训练与 rollout 精度流实现基于策略的 FP8 强化学习
Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
January 20, 2026
作者: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu
cs.AI
摘要
强化学习(RL)对于增强大语言模型(LLM)的复杂推理能力至关重要。然而,现有的RL训练流程存在计算效率低、资源消耗大的问题,其中推演阶段耗时占比超过总训练时长的70%。采用FP8精度的量化RL训练为缓解这一瓶颈提供了可行路径。当前主流策略是在推演阶段使用FP8精度,同时保持训练阶段的BF16精度。本研究首次对FP8强化学习训练展开系统性探索,发现广泛采用的"BF16训练+FP8推演"方案在长序列推演和复杂任务中会出现严重训练失稳及灾难性精度崩塌。分析表明,该方法的离策略特性导致训练与推理阶段存在显著数值失配。基于此,我们提出Jet-RL框架——采用训练与推演统一的FP8精度流,最大限度减少数值差异并消除低效的步间校准机制。大量实验验证了Jet-RL的有效性:相较BF16训练,我们的方法在推演阶段实现最高33%加速,训练阶段最高41%加速,端到端提速达16%,且在所有设定下均保持稳定收敛,精度损失可忽略不计。
English
Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.