이미지 확산에서의 창발적 대응성

초록

이미지 간의 대응 관계를 찾는 것은 컴퓨터 비전의 근본적인 문제입니다. 본 논문에서는 명시적인 지도 없이도 이미지 확산 모델에서 대응 관계가 자연스럽게 나타남을 보여줍니다. 우리는 확산 네트워크에서 이러한 암묵적 지식을 이미지 특징으로 추출하는 간단한 전략, 즉 DIffusion FeaTures(DIFT)를 제안하고 이를 실제 이미지 간의 대응 관계를 설정하는 데 사용합니다. 작업별 데이터나 주석에 대한 추가적인 미세 조정이나 지도 없이도 DIFT는 의미적, 기하학적, 시간적 대응 관계를 식별하는 데 있어 약한 지도 방법과 경쟁력 있는 기성 특징들을 능가할 수 있습니다. 특히 의미적 대응 관계의 경우, Stable Diffusion에서 추출한 DIFT는 도전적인 SPair-71k 벤치마크에서 DINO와 OpenCLIP을 각각 19 및 14 정확도 포인트 차이로 앞섭니다. 또한 18개 카테고리 중 9개에서 최신 지도 방법을 능가하면서도 전체 성능에서는 동등한 수준을 유지합니다. 프로젝트 페이지: https://diffusionfeatures.github.io

English

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

이미지 확산에서의 창발적 대응성

Emergent Correspondence from Image Diffusion

초록

Support