DiffusionNFT：基于前向过程的在线扩散强化学习

摘要

在线强化学习（RL）在语言模型的后训练中占据核心地位，但其向扩散模型的扩展因难以处理的似然性而面临挑战。近期研究通过离散化反向采样过程实现了GRPO风格的训练，但这些方法仍存在根本性缺陷，包括求解器限制、前向-反向不一致性，以及与无分类器引导（CFG）的复杂整合。我们提出了扩散负感知微调（DiffusionNFT），这是一种新的在线RL范式，它通过流匹配直接在正向过程中优化扩散模型。DiffusionNFT通过对比正负生成来定义隐式的策略改进方向，自然地将强化信号融入监督学习目标。这一表述允许使用任意黑箱求解器进行训练，无需似然估计，且仅需干净图像而非采样轨迹来进行策略优化。在直接比较中，DiffusionNFT的效率比FlowGRPO高出多达25倍，同时无需CFG。例如，DiffusionNFT在1千步内将GenEval评分从0.24提升至0.98，而FlowGRPO需超过5千步并额外使用CFG才能达到0.95。通过利用多个奖励模型，DiffusionNFT显著提升了SD3.5-Medium在所有测试基准中的表现。

English

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to 25times more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

DiffusionNFT：基于前向过程的在线扩散强化学习

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

摘要

Support