圖像擴散中的新興對應

摘要

在電腦視覺中，尋找影像之間的對應關係是一個基本問題。本文展示了在影像擴散模型中，對應關係是如何在沒有明確監督的情況下出現的。我們提出了一種簡單的策略，從擴散網絡中提取這種隱含的知識作為影像特徵，即擴散特徵（DIFT），並使用它們來建立真實影像之間的對應關係。在任務特定數據或標註上沒有額外的微調或監督的情況下，DIFT 能夠在識別語義、幾何和時間對應關係方面優於弱監督方法和競爭性的現成特徵。尤其是對於語義對應，來自穩定擴散的 DIFT 能夠在具有挑戰性的 SPair-71k 基準測試中分別比 DINO 和 OpenCLIP 高出 19 和 14 個準確度點。它甚至在 18 個類別中有 9 個超越了最先進的監督方法，同時在整體性能上保持一致。項目頁面：https://diffusionfeatures.github.io

English

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

圖像擴散中的新興對應

Emergent Correspondence from Image Diffusion

摘要

Support