ChatPaper.aiChatPaper

Jet-RL:透過統一的訓練與推論精度流程實現基於策略的FP8強化學習

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

January 20, 2026
作者: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu
cs.AI

摘要

強化學習(RL)對於提升大型語言模型(LLM)的複雜推理能力至關重要。然而,現有的RL訓練流程存在計算效率低、資源消耗大的問題,其中推演階段佔總訓練時間的70%以上。採用FP8精度的量化RL訓練為緩解此瓶頸提供了可行方案。當前主流策略是在推演階段使用FP8精度,同時保持訓練階段的BF16精度。本研究首次對FP8 RL訓練進行系統性分析,發現廣泛採用的「BF16訓練+FP8推演」策略在長序列推演和複雜任務中會出現嚴重的訓練不穩定與準確率崩塌現象。分析表明,該策略因脫機學習特性導致訓練與推斷階段的數值失配,進而引發上述問題。基於此發現,我們提出Jet-RL框架,採用訓練與推演統一的FP8精度流,最大限度減少數值差異並消除低效的步間校準需求。大量實驗驗證了Jet-RL的有效性:相較BF16訓練,我們的方法在推演階段實現最高33%的加速,訓練階段最高41%的加速,端到端提速達16%,且在所有設定下均保持穩定收斂,準確率損失可忽略不計。
English
Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.
PDF152January 27, 2026