SuperEdit:修正與促進基於指令的圖像編輯監督
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
May 5, 2025
作者: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
cs.AI
摘要
由於手動收集精確編輯數據面臨挑戰,現有數據集通常採用各種自動化方法構建,這導致了編輯指令與原始-編輯圖像對之間不匹配所產生的噪聲監督信號。近期研究嘗試通過生成更高質量的編輯圖像、在識別任務上進行預訓練或引入視覺-語言模型(VLMs)來改進編輯模型,但未能解決這一根本問題。本文提出了一種新穎的解決方案,即為給定的圖像對構建更有效的編輯指令。這包括修正編輯指令以更好地與原始-編輯圖像對對齊,並使用對比編輯指令進一步提升其有效性。具體而言,我們發現編輯模型在不同推理步驟中表現出特定的生成屬性,這些屬性與文本無關。基於這些先驗屬性,我們為VLMs定義了一個統一的指導來修正編輯指令。然而,存在一些僅靠修正指令無法解決的具有挑戰性的編輯場景。為此,我們進一步構建了包含正負指令的對比監督信號,並通過三元組損失將其引入模型訓練,從而進一步提升監督的有效性。我們的方法無需使用先前工作中使用的VLM模塊或預訓練任務,提供了一種更直接且高效的方式來提供更好的監督信號,為基於指令的圖像編輯提供了一種新穎、簡單且有效的解決方案。在多個基準測試上的結果表明,我們的方法顯著優於現有方法。與之前的SOTA SmartEdit相比,我們在Real-Edit基準上實現了9.19%的提升,同時訓練數據量減少了30倍,模型規模縮小了13倍。
English
Due to the challenges of manually collecting accurate editing data, existing
datasets are typically constructed using various automated methods, leading to
noisy supervision signals caused by the mismatch between editing instructions
and original-edited image pairs. Recent efforts attempt to improve editing
models through generating higher-quality edited images, pre-training on
recognition tasks, or introducing vision-language models (VLMs) but fail to
resolve this fundamental issue. In this paper, we offer a novel solution by
constructing more effective editing instructions for given image pairs. This
includes rectifying the editing instructions to better align with the
original-edited image pairs and using contrastive editing instructions to
further enhance their effectiveness. Specifically, we find that editing models
exhibit specific generation attributes at different inference steps,
independent of the text. Based on these prior attributes, we define a unified
guide for VLMs to rectify editing instructions. However, there are some
challenging editing scenarios that cannot be resolved solely with rectified
instructions. To this end, we further construct contrastive supervision signals
with positive and negative instructions and introduce them into the model
training using triplet loss, thereby further facilitating supervision
effectiveness. Our method does not require the VLM modules or pre-training
tasks used in previous work, offering a more direct and efficient way to
provide better supervision signals, and providing a novel, simple, and
effective solution for instruction-based image editing. Results on multiple
benchmarks demonstrate that our method significantly outperforms existing
approaches. Compared with previous SOTA SmartEdit, we achieve 9.19%
improvements on the Real-Edit benchmark with 30x less training data and 13x
smaller model size.Summary
AI-Generated Summary