FP4探索、BF16训练：基于高效推演扩展的扩散强化学习

摘要

基于强化学习的后训练技术近期已成为将文本到图像扩散模型与人类偏好对齐的重要范式。最新研究表明，增大采样批次规模能带来显著性能提升，预示着对齐效果仍存巨大优化空间。然而，在FLUX.1-12B等大规模基础扩散模型上扩展采样过程会带来沉重计算负担。为突破此瓶颈，我们探索将FP4量化技术融入扩散强化学习的采样环节。但研究发现，简单量化流程会固有地引入性能退化风险。为破解效率与训练完整性之间的两难困境，我们提出Sol-RL（光速强化学习）——一种创新性的FP4赋能双阶段强化学习框架。首先采用高吞吐NVFP4采样生成海量候选样本池，并提取高对比度子集；随后以BF16精度重新生成选定样本，并仅基于这些样本优化策略模型。通过将候选探索与策略优化解耦，Sol-RL既融合了扩展采样规模的算法机制，又获得了NVFP4在系统层面的吞吐增益。这种算法-硬件协同设计在加速采样阶段的同时，为优化环节保留了高保真样本。实验证实，我们的框架在充分挖掘FP4算力吞吐优势的同时，完整保持了BF16精度流程的训练完整性。在SANA、FLUX.1和SD3.5-L上的大量实验表明，该方法在多项指标上均实现更优的对齐性能，并将训练收敛速度最高提升4.64倍，以极低成本释放了大规模采样扩展的潜力。

English

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to 4.64times, unlocking the power of massive rollout scaling at a fraction of the cost.

FP4探索、BF16训练：基于高效推演扩展的扩散强化学习

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

摘要

Support