DiffusionNFT: 순방향 프로세스를 통한 온라인 확산 강화 학습

초록

온라인 강화 학습(RL)은 사후 학습 언어 모델의 핵심이 되어왔지만, 확산 모델로의 확장은 다루기 어려운 가능성 때문에 여전히 어려운 과제로 남아 있습니다. 최근 연구들은 역 샘플링 과정을 이산화하여 GRPO 스타일의 학습을 가능하게 했지만, 이는 솔버 제약, 순방향-역방향 불일치, 그리고 분류자 없는 지도(CFG)와의 복잡한 통합과 같은 근본적인 단점을 그대로 물려받았습니다. 우리는 순방향 과정에서 직접 확산 모델을 최적화하는 새로운 온라인 RL 패러다임인 Diffusion Negative-aware FineTuning(DiffusionNFT)을 소개합니다. DiffusionNFT는 긍정적 생성과 부정적 생성을 대조하여 암묵적인 정책 개선 방향을 정의하며, 강화 신호를 지도 학습 목표에 자연스럽게 통합합니다. 이 공식화는 임의의 블랙박스 솔버를 사용한 학습을 가능하게 하고, 가능성 추정의 필요성을 없애며, 정책 최적화를 위해 샘플링 궤적 대신 깨끗한 이미지만 필요로 합니다. DiffusionNFT는 FlowGRPO와의 직접 비교에서 최대 25배 더 효율적이며, CFG를 사용하지 않습니다. 예를 들어, DiffusionNFT는 1,000단계 내에서 GenEval 점수를 0.24에서 0.98로 향상시키는 반면, FlowGRPO는 5,000단계 이상과 추가 CFG 사용으로 0.95를 달성합니다. 다중 보상 모델을 활용함으로써, DiffusionNFT는 SD3.5-Medium의 성능을 모든 벤치마크에서 크게 향상시킵니다.

English

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to 25times more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

DiffusionNFT: 순방향 프로세스를 통한 온라인 확산 강화 학습

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

초록

Support