DiffusionNFT:基于前向过程的在线扩散强化
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
September 19, 2025
作者: Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
cs.AI
摘要
線上強化學習(RL)在後訓練語言模型中佔據核心地位,但其擴展至擴散模型仍面臨挑戰,主要由於難以處理的似然性。近期研究通過離散化反向採樣過程來實現GRPO風格的訓練,然而這些方法繼承了根本性缺陷,包括求解器限制、前後向不一致性,以及與無分類器指導(CFG)的複雜整合。我們提出了擴散負感知微調(DiffusionNFT),這是一種新的線上RL範式,它通過流匹配直接在正向過程中優化擴散模型。DiffusionNFT對比正負生成以定義隱含的策略改進方向,自然地將強化信號融入監督學習目標中。此種表述方式允許使用任意黑箱求解器進行訓練,消除了對似然估計的需求,並僅需乾淨圖像而非採樣軌跡來進行策略優化。在直接比較中,DiffusionNFT的效率比FlowGRPO高出多達25倍,且無需CFG。例如,DiffusionNFT在1千步內將GenEval分數從0.24提升至0.98,而FlowGRPO則需超過5千步並額外應用CFG才能達到0.95。通過利用多個獎勵模型,DiffusionNFT在每個測試基準上均顯著提升了SD3.5-Medium的性能。
English
Online reinforcement learning (RL) has been central to post-training language
models, but its extension to diffusion models remains challenging due to
intractable likelihoods. Recent works discretize the reverse sampling process
to enable GRPO-style training, yet they inherit fundamental drawbacks,
including solver restrictions, forward-reverse inconsistency, and complicated
integration with classifier-free guidance (CFG). We introduce Diffusion
Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that
optimizes diffusion models directly on the forward process via flow matching.
DiffusionNFT contrasts positive and negative generations to define an implicit
policy improvement direction, naturally incorporating reinforcement signals
into the supervised learning objective. This formulation enables training with
arbitrary black-box solvers, eliminates the need for likelihood estimation, and
requires only clean images rather than sampling trajectories for policy
optimization. DiffusionNFT is up to 25times more efficient than FlowGRPO in
head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT
improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO
achieves 0.95 with over 5k steps and additional CFG employment. By leveraging
multiple reward models, DiffusionNFT significantly boosts the performance of
SD3.5-Medium in every benchmark tested.