DiffusionNFT：基于前向过程的在线扩散强化

摘要

線上強化學習（RL）在後訓練語言模型中佔據核心地位，但其擴展至擴散模型仍面臨挑戰，主要由於難以處理的似然性。近期研究通過離散化反向採樣過程來實現GRPO風格的訓練，然而這些方法繼承了根本性缺陷，包括求解器限制、前後向不一致性，以及與無分類器指導（CFG）的複雜整合。我們提出了擴散負感知微調（DiffusionNFT），這是一種新的線上RL範式，它通過流匹配直接在正向過程中優化擴散模型。DiffusionNFT對比正負生成以定義隱含的策略改進方向，自然地將強化信號融入監督學習目標中。此種表述方式允許使用任意黑箱求解器進行訓練，消除了對似然估計的需求，並僅需乾淨圖像而非採樣軌跡來進行策略優化。在直接比較中，DiffusionNFT的效率比FlowGRPO高出多達25倍，且無需CFG。例如，DiffusionNFT在1千步內將GenEval分數從0.24提升至0.98，而FlowGRPO則需超過5千步並額外應用CFG才能達到0.95。通過利用多個獎勵模型，DiffusionNFT在每個測試基準上均顯著提升了SD3.5-Medium的性能。

English

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to 25times more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

DiffusionNFT：基于前向过程的在线扩散强化

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

摘要

Support