ChatPaper.aiChatPaper

Instruct-CLIP:利用對比學習進行自動數據精煉以提升指令引導的圖像編輯效果

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

March 24, 2025
作者: Sherry X. Chen, Misha Sra, Pradeep Sen
cs.AI

摘要

儘管自然語言指令提供了一種直觀的方式來引導自動化圖像編輯,深度學習模型往往難以實現高質量的結果,這主要歸因於創建大規模、高質量訓練數據集的挑戰。先前的研究通常依賴於文本到圖像(T2I)生成模型來生成模擬指令引導圖像編輯模型輸入/輸出的原始與編輯後圖像對。然而,由於T2I模型的限制,這些圖像對往往無法與指定的編輯指令保持一致,這對基於此類數據集訓練的模型產生了負面影響。為解決這一問題,我們提出了Instruct-CLIP,這是一種自監督方法,它學習原始圖像與編輯後圖像之間的語義變化,以精煉並更好地對齊現有數據集中的指令。此外,我們調整了Instruct-CLIP以處理噪聲潛在圖像和擴散時間步,使其能夠用於訓練潛在擴散模型(LDMs)[19],並在擴散管道的任何步驟中有效強制編輯指令與圖像變化在潛在空間中的對齊。我們使用Instruct-CLIP來校正InstructPix2Pix數據集,並獲得了超過120K的精煉樣本,隨後我們利用這些樣本,在我們新穎的基於Instruct-CLIP的損失函數指導下,對其模型進行微調。最終的模型能夠生成與給定指令更加一致的編輯結果。我們的代碼和數據集可在https://github.com/SherryXTChen/Instruct-CLIP.git獲取。
English
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Summary

AI-Generated Summary

PDF32March 25, 2025