Uniworld-V2:通过扩散负感知微调与多模态大语言模型隐式反馈强化图像编辑
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
October 19, 2025
作者: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Li Yuan
cs.AI
摘要
基于指令的图像编辑已取得显著进展;然而,仅通过监督微调训练的模型往往过度拟合标注模式,限制了其在训练分布之外探索和泛化的能力。为此,我们提出了Edit-R1,一种基于策略优化的新型后训练框架,专为指令驱动的图像编辑设计。具体而言,我们采用扩散负感知微调(DiffusionNFT),这是一种与流匹配前向过程一致的无似然策略优化方法,从而支持使用高阶采样器并实现更高效的训练。另一个关键挑战在于缺乏统一的奖励模型,这源于编辑指令和任务的多样性。为弥合这一差距,我们利用多模态大语言模型(MLLM)作为无需训练的统一奖励模型,通过其输出逻辑提供细粒度反馈。此外,我们精心设计了一种低方差群体过滤机制,以减少MLLM评分噪声并稳定优化过程。采用此框架训练的UniWorld-V2,在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的分数,达到了业界领先水平。重要的是,我们的框架具有模型无关性,当应用于如Qwen-Image-Edit和FLUX-Kontext等多样化基础模型时,均带来了显著的性能提升,展现了其广泛的适用性。代码和模型已公开于https://github.com/PKU-YuanGroup/UniWorld-V2。
English
Instruction-based image editing has achieved remarkable progress; however,
models solely trained via supervised fine-tuning often overfit to annotated
patterns, hindering their ability to explore and generalize beyond training
distributions. To this end, we introduce Edit-R1, a novel post-training
framework for instruction-based image editing based on policy optimization.
Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a
likelihood-free policy optimization method consistent with the flow matching
forward process, thereby enabling the use of higher-order samplers and more
efficient training. Another key challenge here is the absence of a universal
reward model, resulting from the diverse nature of editing instructions and
tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM)
as a unified, training-free reward model, leveraging its output logits to
provide fine-grained feedback. Furthermore, we carefully design a low-variance
group filtering mechanism to reduce MLLM scoring noise and stabilize
optimization. UniWorld-V2, trained with this framework, achieves
state-of-the-art results on the ImgEdit and GEdit-Bench benchmarks,
scoring 4.49 and 7.83, respectively. Crucially, our framework is
model-agnostic, delivering substantial performance gains when applied to
diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its
wide applicability. Code and models are publicly available at
https://github.com/PKU-YuanGroup/UniWorld-V2.