SuperEdit:优化与简化基于指令的图像编辑监督机制
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
May 5, 2025
作者: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
cs.AI
摘要
鉴于手动收集精确编辑数据存在挑战,现有数据集通常采用多种自动化方法构建,这导致了编辑指令与原图-编辑后图像对之间的不匹配,从而产生了噪声监督信号。近期研究尝试通过生成更高质量的编辑图像、在识别任务上进行预训练或引入视觉-语言模型(VLMs)来改进编辑模型,但未能从根本上解决这一问题。本文提出了一种新颖的解决方案,即为给定的图像对构建更有效的编辑指令。这包括修正编辑指令以更好地与原图-编辑后图像对对齐,以及使用对比编辑指令进一步增强其有效性。具体而言,我们发现编辑模型在不同推理步骤中展现出特定的生成属性,这些属性与文本无关。基于这些先验属性,我们为VLMs定义了一个统一的指导原则来修正编辑指令。然而,仅靠修正指令无法解决所有具有挑战性的编辑场景。为此,我们进一步构建了包含正负指令的对比监督信号,并通过三元组损失将其引入模型训练,从而进一步提升监督效果。我们的方法无需依赖先前工作中使用的VLM模块或预训练任务,提供了一种更直接且高效的方式来提供更好的监督信号,为基于指令的图像编辑提供了一个新颖、简单且有效的解决方案。在多个基准测试上的结果表明,我们的方法显著优于现有方法。与之前的最先进方法SmartEdit相比,我们在Real-Edit基准上实现了9.19%的提升,同时训练数据量减少了30倍,模型规模缩小了13倍。
English
Due to the challenges of manually collecting accurate editing data, existing
datasets are typically constructed using various automated methods, leading to
noisy supervision signals caused by the mismatch between editing instructions
and original-edited image pairs. Recent efforts attempt to improve editing
models through generating higher-quality edited images, pre-training on
recognition tasks, or introducing vision-language models (VLMs) but fail to
resolve this fundamental issue. In this paper, we offer a novel solution by
constructing more effective editing instructions for given image pairs. This
includes rectifying the editing instructions to better align with the
original-edited image pairs and using contrastive editing instructions to
further enhance their effectiveness. Specifically, we find that editing models
exhibit specific generation attributes at different inference steps,
independent of the text. Based on these prior attributes, we define a unified
guide for VLMs to rectify editing instructions. However, there are some
challenging editing scenarios that cannot be resolved solely with rectified
instructions. To this end, we further construct contrastive supervision signals
with positive and negative instructions and introduce them into the model
training using triplet loss, thereby further facilitating supervision
effectiveness. Our method does not require the VLM modules or pre-training
tasks used in previous work, offering a more direct and efficient way to
provide better supervision signals, and providing a novel, simple, and
effective solution for instruction-based image editing. Results on multiple
benchmarks demonstrate that our method significantly outperforms existing
approaches. Compared with previous SOTA SmartEdit, we achieve 9.19%
improvements on the Real-Edit benchmark with 30x less training data and 13x
smaller model size.Summary
AI-Generated Summary