Instruct-CLIP: 대조 학습을 활용한 자동 데이터 정제를 통한 지시 기반 이미지 편집 개선

초록

자연어 지시문은 자동화된 이미지 편집을 직관적으로 안내할 수 있는 방법을 제공하지만, 대규모 고품질 학습 데이터셋을 구축하는 데 어려움이 있어 딥러닝 모델이 고품질 결과를 달성하기는 쉽지 않습니다. 기존 연구에서는 주로 텍스트-이미지(T2I) 생성 모델을 활용하여 지시문 기반 이미지 편집 모델의 입력/출력을 시뮬레이션하는 원본 및 편집된 이미지 쌍을 생성했습니다. 그러나 이러한 이미지 쌍은 T2I 모델의 한계로 인해 지정된 편집 지시문과 잘 맞지 않는 경우가 많으며, 이는 해당 데이터셋으로 학습된 모델에 부정적인 영향을 미칩니다. 이를 해결하기 위해, 우리는 Instruct-CLIP를 제안합니다. 이는 기존 데이터셋의 지시문을 개선하고 더 잘 정렬하기 위해 원본 및 편집된 이미지 간의 의미적 변화를 학습하는 자기 지도(self-supervised) 방법입니다. 또한, Instruct-CLIP를 잡음이 있는 잠재 이미지와 디퓨전 타임스텝을 처리할 수 있도록 조정하여 잠재 디퓨전 모델(LDMs) [19]을 학습하는 데 사용할 수 있게 하고, 디퓨전 파이프라인의 어느 단계에서나 편집 지시문과 잠재 공간에서의 이미지 변화 간의 정렬을 효율적으로 강제할 수 있도록 합니다. 우리는 Instruct-CLIP를 사용하여 InstructPix2Pix 데이터셋을 수정하고 120,000개 이상의 정제된 샘플을 얻은 후, 이를 우리의 새로운 Instruct-CLIP 기반 손실 함수를 통해 해당 모델을 미세 조정하는 데 사용했습니다. 그 결과, 주어진 지시문과 더 잘 정렬된 편집을 생성할 수 있는 모델을 얻었습니다. 우리의 코드와 데이터셋은 https://github.com/SherryXTChen/Instruct-CLIP.git에서 확인할 수 있습니다.

English

Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Instruct-CLIP: 대조 학습을 활용한 자동 데이터 정제를 통한 지시 기반 이미지 편집 개선

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

초록

Support