利用穩定擴散進行非監督式語義對應

摘要

現在的文本到圖像擴散模型已經能夠生成與真實圖像常常難以區分的圖像。為了生成這樣的圖像，這些模型必須理解被要求生成的物體的語義。在這項工作中，我們展示了在沒有任何訓練的情況下，可以利用擴散模型內的語義知識來找到語義對應 - 多個圖像中具有相同語義意義的位置。具體來說，給定一個圖像，我們優化這些模型的提示嵌入，以便最大程度地關注感興趣的區域。這些優化的嵌入捕捉了有關位置的語義信息，然後可以將這些信息轉移到另一個圖像上。通過這樣做，我們在PF-Willow數據集上獲得了與強監督狀態下的最新技術相當的結果，並且在PF-Willow、CUB-200和SPair-71k數據集上明顯優於現有的任何弱監督或無監督方法（對於SPair-71k數據集，相對提升了20.9%）。

English

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

利用穩定擴散進行非監督式語義對應

Unsupervised Semantic Correspondence Using Stable Diffusion

摘要

Support