안정적 확산을 활용한 비지도 시맨틱 대응 학습

초록

텍스트-이미지 확산 모델은 이제 실제 이미지와 구별하기 어려운 수준의 이미지를 생성할 수 있습니다. 이러한 이미지를 생성하기 위해, 이 모델들은 생성하도록 요청받은 객체의 의미론을 이해해야 합니다. 본 연구에서는 어떠한 학습 없이도 확산 모델 내부에 있는 이러한 의미론적 지식을 활용하여 여러 이미지 간에 동일한 의미를 가지는 위치, 즉 의미론적 대응 관계를 찾을 수 있음을 보여줍니다. 구체적으로, 주어진 이미지에 대해 관심 영역에 대한 최대 주의를 끌기 위해 이 모델들의 프롬프트 임베딩을 최적화합니다. 이러한 최적화된 임베딩은 해당 위치에 대한 의미론적 정보를 포착하며, 이를 다른 이미지로 전달할 수 있습니다. 이를 통해 PF-Willow 데이터셋에서 강력한 지도 학습(state of the art) 수준의 결과를 얻었으며, PF-Willow, CUB-200 및 SPair-71k 데이터셋에서 기존의 약지도 또는 비지도 학습 방법들을 크게 능가하는 성능(SPair-71k 데이터셋에서 20.9% 상대적 개선)을 달성했습니다.

English

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

안정적 확산을 활용한 비지도 시맨틱 대응 학습

Unsupervised Semantic Correspondence Using Stable Diffusion

초록

Support