ChatPaper.aiChatPaper

无需图像编辑对即可学习的图像编辑模型

Learning an Image Editing Model without Image Editing Pairs

October 16, 2025
作者: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang
cs.AI

摘要

近期,图像编辑模型在遵循自然语言编辑指令方面取得了显著成果,但这些模型依赖于大规模输入-目标对数据集的有监督微调。这构成了一个关键瓶颈,因为此类自然生成的对数据难以大规模收集。当前的解决方案是利用现有模型的零样本能力生成合成训练对。然而,这种做法可能会将预训练模型的瑕疵传播并放大到最终训练模型中。在本研究中,我们提出了一种全新的训练范式,彻底摆脱了对配对数据的依赖。我们的方法通过在训练过程中展开多步扩散模型,并利用视觉-语言模型(VLM)的反馈,直接优化模型。对于每个输入和编辑指令,VLM评估编辑是否遵循指令并保留未改变的内容,从而为端到端优化提供直接梯度。为确保视觉保真度,我们引入了分布匹配损失(DMD),约束生成图像保持在预训练模型学习到的图像流形内。我们在标准基准上评估了该方法,并进行了广泛的消融研究。在无需任何配对数据的情况下,我们的方法在少步设置下,与基于大量有监督配对数据训练的各种图像编辑扩散模型表现相当。在采用相同VLM作为奖励模型的情况下,我们还超越了基于强化学习的技术,如Flow-GRPO。
English
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
PDF62October 17, 2025