通过修复绘画学习：通过首先移除图像对象来学习添加图像对象

摘要

随着文本条件扩散模型的引入，图像编辑取得了显著进展。尽管如此，根据文本指令无缝添加对象到图像而无需用户提供输入蒙版仍然是一个挑战。我们通过利用这样的洞察力来解决这个问题，即移除对象（修补）明显比添加对象（绘制）简单得多，这归因于在修补模型内部修补这些蒙版的分割蒙版数据集的利用。借助这一认识，通过实施自动化和广泛的流程，我们筛选了一个大规模图像数据集，其中包含图像及其相应的去除对象版本的配对。利用这些配对，我们训练了一个扩散模型来逆转修补过程，有效地将对象添加到图像中。与其他编辑数据集不同，我们的数据集以自然目标图像为特色，而不是合成图像；此外，它通过构建保持了源图像和目标图像之间的一致性。此外，我们利用一个大型视觉语言模型提供被移除对象的详细描述，并利用一个大型语言模型将这些描述转换为多样化的自然语言指令。我们展示了训练模型在质量和数量上均超越了现有模型，并为社区发布了这一大规模数据集以及训练好的模型。

English

Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.

通过修复绘画学习：通过首先移除图像对象来学习添加图像对象

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

摘要

Support