ChatPaper.aiChatPaper

FP4探索、BF16訓練:基於高效能推論擴展的擴散強化學習

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

April 8, 2026
作者: Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie
cs.AI

摘要

基於強化學習的後訓練技術近期已成為對齊文字生成影像擴散模型與人類偏好的新興範式。最新研究表明,增大滾動生成組規模能帶來顯著性能提升,這預示著對齊效果仍有巨大改進空間。然而,在大型基礎擴散模型(如FLUX.1-12B)上進行規模化滾動生成會帶來沉重的計算負擔。為緩解此瓶頸,我們探索將FP4量化技術整合至擴散強化學習的滾動生成流程。但我們發現,簡單的量化管線會固有地引入性能衰退風險。為破解效率與訓練完整性之間的兩難困境,我們提出Sol-RL(光速強化學習)——一種新型FP4賦能的兩階段強化學習框架。首先,我們利用高吞吐量的NVFP4滾動生成構建大規模候選樣本池,並提取高對比度子集;其次,以BF16精度重新生成這些精選樣本,並僅基於它們進行策略優化。通過將候選探索與策略優化解耦,Sol-RL有機融合了滾動規模擴展的算法機制與NVFP4的系統級吞吐增益。這種算法-硬件協同設計在有效加速滾動階段的同時,為優化階段保留高保真樣本。我們通過實證研究表明,該框架在充分發揮FP4算術吞吐增益的同時,完整保持了BF16精度管線的訓練完整性。在SANA、FLUX.1和SD3.5-L上的大量實驗證實,我們的方法在多項指標上均實現更優的對齊性能,同時將訓練收斂速度提升最高達4.64倍,以極低成本釋放了規模化滾動生成的潛能。
English
Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to 4.64times, unlocking the power of massive rollout scaling at a fraction of the cost.
PDF100April 10, 2026