通过修复绘画学习:通过首先移除图像对象来学习添加图像对象
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
April 28, 2024
作者: Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel
cs.AI
摘要
随着文本条件扩散模型的引入,图像编辑取得了显著进展。尽管如此,根据文本指令无缝添加对象到图像而无需用户提供输入蒙版仍然是一个挑战。我们通过利用这样的洞察力来解决这个问题,即移除对象(修补)明显比添加对象(绘制)简单得多,这归因于在修补模型内部修补这些蒙版的分割蒙版数据集的利用。借助这一认识,通过实施自动化和广泛的流程,我们筛选了一个大规模图像数据集,其中包含图像及其相应的去除对象版本的配对。利用这些配对,我们训练了一个扩散模型来逆转修补过程,有效地将对象添加到图像中。与其他编辑数据集不同,我们的数据集以自然目标图像为特色,而不是合成图像;此外,它通过构建保持了源图像和目标图像之间的一致性。此外,我们利用一个大型视觉语言模型提供被移除对象的详细描述,并利用一个大型语言模型将这些描述转换为多样化的自然语言指令。我们展示了训练模型在质量和数量上均超越了现有模型,并为社区发布了这一大规模数据集以及训练好的模型。
English
Image editing has advanced significantly with the introduction of
text-conditioned diffusion models. Despite this progress, seamlessly adding
objects to images based on textual instructions without requiring user-provided
input masks remains a challenge. We address this by leveraging the insight that
removing objects (Inpaint) is significantly simpler than its inverse process of
adding them (Paint), attributed to the utilization of segmentation mask
datasets alongside inpainting models that inpaint within these masks.
Capitalizing on this realization, by implementing an automated and extensive
pipeline, we curate a filtered large-scale image dataset containing pairs of
images and their corresponding object-removed versions. Using these pairs, we
train a diffusion model to inverse the inpainting process, effectively adding
objects into images. Unlike other editing datasets, ours features natural
target images instead of synthetic ones; moreover, it maintains consistency
between source and target by construction. Additionally, we utilize a large
Vision-Language Model to provide detailed descriptions of the removed objects
and a Large Language Model to convert these descriptions into diverse,
natural-language instructions. We show that the trained model surpasses
existing ones both qualitatively and quantitatively, and release the
large-scale dataset alongside the trained models for the community.Summary
AI-Generated Summary