利用稳定扩散进行无监督语义对应

摘要

文本到图像扩散模型现在能够生成与真实图像往往难以区分的图像。为了生成这样的图像，这些模型必须理解要生成的对象的语义。在这项工作中，我们展示了在没有任何训练的情况下，可以利用扩散模型内的这种语义知识来找到语义对应关系 -- 多个图像中具有相同语义含义的位置。具体而言，给定一幅图像，我们优化这些模型的提示嵌入，使其最大程度地关注感兴趣的区域。这些优化的嵌入捕获了关于位置的语义信息，然后可以转移到另一幅图像上。通过这样做，我们在PF-Willow数据集上获得了与强监督最先进技术相当的结果，并且在PF-Willow、CUB-200和SPair-71k数据集上明显优于任何现有的弱监督或无监督方法（对于SPair-71k数据集，相对提高了20.9%）。

English

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

利用稳定扩散进行无监督语义对应

Unsupervised Semantic Correspondence Using Stable Diffusion

摘要

Support