ChatPaper.aiChatPaper

Instruct-CLIP:通过对比学习实现自动数据优化,提升指令引导的图像编辑效果

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

March 24, 2025
作者: Sherry X. Chen, Misha Sra, Pradeep Sen
cs.AI

摘要

尽管自然语言指令为自动化图像编辑提供了直观的引导方式,深度学习模型在实现高质量结果方面往往面临挑战,这主要源于构建大规模、高质量训练数据集的困难。以往的研究通常依赖文本到图像(T2I)生成模型来生成原始图像与编辑后图像的配对,以模拟指令引导图像编辑模型的输入输出。然而,由于T2I模型的局限性,这些图像对往往无法与指定的编辑指令精确对应,从而对基于此类数据集训练的模型产生负面影响。为解决这一问题,我们提出了Instruct-CLIP,一种自监督方法,它学习原始图像与编辑图像之间的语义变化,以优化并更好地对齐现有数据集中的指令。此外,我们调整Instruct-CLIP以处理噪声潜在图像和扩散时间步,使其能够用于训练潜在扩散模型(LDMs)[19],并在扩散流程的任何步骤中有效强化编辑指令与潜在空间图像变化之间的对齐。我们利用Instruct-CLIP校正InstructPix2Pix数据集,获得了超过12万条精炼样本,随后基于我们新颖的Instruct-CLIP损失函数指导,对这些样本进行模型微调。最终得到的模型能够生成与给定指令更加契合的编辑效果。我们的代码和数据集已发布于https://github.com/SherryXTChen/Instruct-CLIP.git。
English
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Summary

AI-Generated Summary

PDF32March 25, 2025