通過修補來繪製:通過首先移除圖像對象來學習添加。
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
April 28, 2024
作者: Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel
cs.AI
摘要
隨著引入以文字條件為基礎的擴散模型,圖像編輯已有顯著進步。儘管如此,根據文字指示無縫地將物件添加到圖像中,而無需使用者提供的輸入遮罩仍然是一個挑戰。我們通過利用一項洞察,即移除物件(修補)明顯比添加它們(繪製)簡單得多,這歸因於在修補模型內部進行修補的分割遮罩數據集的應用。利用這一領悟,通過實施自動化和廣泛的流程,我們精心編輯了一個過濾後的大規模圖像數據集,其中包含圖像及其對應的已移除物件版本。利用這些對,我們訓練了一個擴散模型,以逆轉修補過程,有效地將物件添加到圖像中。與其他編輯數據集不同,我們的數據集具有自然目標圖像,而非合成圖像;此外,它通過構建保持了源圖像和目標圖像之間的一致性。此外,我們利用一個大型視覺語言模型提供已移除物件的詳細描述,並使用大型語言模型將這些描述轉換為多樣且自然的語言指示。我們展示了訓練模型在質量和量化方面均超越現有模型,並釋放了這個大規模數據集以及訓練好的模型供社群使用。
English
Image editing has advanced significantly with the introduction of
text-conditioned diffusion models. Despite this progress, seamlessly adding
objects to images based on textual instructions without requiring user-provided
input masks remains a challenge. We address this by leveraging the insight that
removing objects (Inpaint) is significantly simpler than its inverse process of
adding them (Paint), attributed to the utilization of segmentation mask
datasets alongside inpainting models that inpaint within these masks.
Capitalizing on this realization, by implementing an automated and extensive
pipeline, we curate a filtered large-scale image dataset containing pairs of
images and their corresponding object-removed versions. Using these pairs, we
train a diffusion model to inverse the inpainting process, effectively adding
objects into images. Unlike other editing datasets, ours features natural
target images instead of synthetic ones; moreover, it maintains consistency
between source and target by construction. Additionally, we utilize a large
Vision-Language Model to provide detailed descriptions of the removed objects
and a Large Language Model to convert these descriptions into diverse,
natural-language instructions. We show that the trained model surpasses
existing ones both qualitatively and quantitatively, and release the
large-scale dataset alongside the trained models for the community.Summary
AI-Generated Summary