安定拡散モデルを用いた教師なし意味的対応付け

要旨

テキストから画像を生成する拡散モデルは、現実の画像と見分けがつかないような画像を生成できるようになりました。このような画像を生成するためには、これらのモデルは生成対象となるオブジェクトの意味を理解する必要があります。本研究では、訓練を一切行わずに、拡散モデル内に存在するこの意味的知識を活用して、複数の画像間で同じ意味を持つ位置（意味的対応）を見つけることができることを示します。具体的には、与えられた画像に対して、関心領域に最大の注意が向くようにプロンプトの埋め込みを最適化します。これらの最適化された埋め込みは、その位置に関する意味的情報を捉えており、それを別の画像に転送することができます。この手法により、PF-Willowデータセットにおいて強教師ありの最先端技術と同等の結果を得ることができ、PF-Willow、CUB-200、SPair-71kデータセットにおいて、既存の弱教師ありまたは教師なしの手法を大幅に上回る性能（SPair-71kデータセットでは20.9%の相対的改善）を達成しました。

English

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

安定拡散モデルを用いた教師なし意味的対応付け

Unsupervised Semantic Correspondence Using Stable Diffusion

要旨

Support