페인트 바이 인페인트: 이미지 객체를 먼저 제거하여 추가하는 방법 학습하기

초록

텍스트 조건부 확산 모델의 도입으로 이미지 편집 기술은 크게 발전했습니다. 그러나 이러한 진전에도 불구하고, 사용자가 제공한 입력 마스크 없이 텍스트 지시에 따라 이미지에 객체를 자연스럽게 추가하는 것은 여전히 어려운 과제로 남아 있습니다. 우리는 이 문제를 해결하기 위해 객체 제거(Inpaint)가 객체 추가(Paint)의 역과정보다 훨씬 간단하다는 통찰을 활용했습니다. 이는 세그멘테이션 마스크 데이터셋과 이러한 마스크 내에서 인페인팅을 수행하는 모델의 활용 덕분입니다. 이러한 깨달음을 바탕으로, 우리는 자동화된 광범위한 파이프라인을 구현하여 객체가 제거된 버전과 원본 이미지 쌍을 포함하는 대규모 필터링된 이미지 데이터셋을 구축했습니다. 이러한 쌍을 사용하여 우리는 인페인팅 과정을 역으로 수행하여 이미지에 객체를 효과적으로 추가하는 확산 모델을 학습시켰습니다. 다른 편집 데이터셋과 달리, 우리의 데이터셋은 합성 이미지 대신 자연스러운 대상 이미지를 특징으로 하며, 구성상 원본과 대상 간의 일관성을 유지합니다. 또한, 우리는 제거된 객체에 대한 상세 설명을 제공하기 위해 대규모 시각-언어 모델을 활용하고, 이러한 설명을 다양한 자연어 지시로 변환하기 위해 대형 언어 모델을 사용했습니다. 학습된 모델이 기존 모델들을 정성적 및 정량적으로 능가함을 보여주며, 대규모 데이터셋과 학습된 모델을 커뮤니티에 공개합니다.

English

Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.

페인트 바이 인페인트: 이미지 객체를 먼저 제거하여 추가하는 방법 학습하기

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

초록

Support