FP4で探索、BF16で学習：効率的なロールアウト拡張による拡散強化学習

要旨

強化学習に基づくポストトレーニングは、テキストから画像への拡散モデルを人間の選好に合わせる有望なパラダイムとして最近登場した。最近の研究では、ロールアウトグループサイズを増加させることで顕著な性能向上が得られており、さらなるアライメント効果の余地が大きいことが示されている。しかし、大規模基盤拡散モデル（例：FLUX.1-12B）でのロールアウトのスケーリングは、多大な計算負荷を課す。このボトルネックを緩和するため、我々はFP4量子化をDiffusion RLロールアウトに統合する手法を探求する。ただし、単純な量子化パイプラインでは性能劣化のリスクが本質的に伴うことを確認した。効率性と学習完全性の間のジレンマを克服するため、我々はSol-RL（Speed-of-light RL）という新しいFP4対応二段階強化学習フレームワークを提案する。第一段階では、高スループットのNVFP4ロールアウトを活用して大規模な候補プールを生成し、対比性の高いサブセットを抽出する。第二段階では、選択されたサンプルをBF16精度で再生成し、それらに特化してポリシーを最適化する。候補探索とポリシー最適化を分離することで、Sol-RLはロールアウトスケーリングのアルゴリズムメカニズムとNVFP4によるシステムレベルのスループット向上を統合する。この相乗的なアルゴリズム-ハードウェア設計により、ロールアウト段階を効率的に加速しつつ、最適化には高精細サンプルを確保する。我々のフレームワークが、BF16精度パイプラインの学習完全性を維持しながら、FP4演算によるスループット向上を最大限に活用することを実証する。SANA、FLUX.1、SD3.5-Lにおける広範な実験により、本手法が複数の指標で優れたアライメント性能を発揮し、トレーニング収束を最大4.64倍加速させつつ、最小限のコストで大規模ロールアウトスケーリングの力を解放することが実証された。

English

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to 4.64times, unlocking the power of massive rollout scaling at a fraction of the cost.

FP4で探索、BF16で学習：効率的なロールアウト拡張による拡散強化学習

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

要旨

Support