Uniworld-V2:通过扩散负感知微调与多模态大语言模型隐式反馈强化图像编辑
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
October 19, 2025
作者: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Li Yuan
cs.AI
摘要
基于指令的图像编辑已取得显著进展;然而,仅通过监督微调训练的模型往往会对标注模式过拟合,阻碍其在训练分布之外进行探索和泛化的能力。为此,我们提出了Edit-R1,一种基于策略优化的新型后训练框架,用于基于指令的图像编辑。具体而言,我们采用扩散负感知微调(Diffusion Negative-aware Finetuning, DiffusionNFT),这是一种与流匹配前向过程一致的无似然策略优化方法,从而能够使用高阶采样器并实现更高效的训练。另一个关键挑战在于缺乏统一的奖励模型,这是由于编辑指令和任务的多样性所致。为弥合这一差距,我们采用多模态大语言模型(Multimodal Large Language Model, MLLM)作为统一的、无需训练的奖励模型,利用其输出逻辑提供细粒度反馈。此外,我们精心设计了一种低方差组过滤机制,以减少MLLM评分噪声并稳定优化过程。采用此框架训练的UniWorld-V2在ImgEdit和GEdit-Bench基准测试中取得了最先进的成绩,分别获得4.49和7.83的评分。重要的是,我们的框架与模型无关,当应用于如Qwen-Image-Edit和FLUX-Kontext等多样化基础模型时,均能带来显著的性能提升,展示了其广泛的适用性。代码和模型已在https://github.com/PKU-YuanGroup/UniWorld-V2公开提供。
English
Instruction-based image editing has achieved remarkable progress; however,
models solely trained via supervised fine-tuning often overfit to annotated
patterns, hindering their ability to explore and generalize beyond training
distributions. To this end, we introduce Edit-R1, a novel post-training
framework for instruction-based image editing based on policy optimization.
Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a
likelihood-free policy optimization method consistent with the flow matching
forward process, thereby enabling the use of higher-order samplers and more
efficient training. Another key challenge here is the absence of a universal
reward model, resulting from the diverse nature of editing instructions and
tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM)
as a unified, training-free reward model, leveraging its output logits to
provide fine-grained feedback. Furthermore, we carefully design a low-variance
group filtering mechanism to reduce MLLM scoring noise and stabilize
optimization. UniWorld-V2, trained with this framework, achieves
state-of-the-art results on the ImgEdit and GEdit-Bench benchmarks,
scoring 4.49 and 7.83, respectively. Crucially, our framework is
model-agnostic, delivering substantial performance gains when applied to
diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its
wide applicability. Code and models are publicly available at
https://github.com/PKU-YuanGroup/UniWorld-V2.