제로샷 오픈-보컬러리 세분화를 위한 확산 모델

초록

실제 세계의 객체 다양성은 거의 무한하며, 따라서 고정된 범주 집합으로 훈련된 모델로는 이를 포착하는 것이 불가능합니다. 그 결과, 최근 몇 년 동안 오픈-보캐뷸러리(open-vocabulary) 방법들이 커뮤니티의 관심을 끌고 있습니다. 본 논문은 제로샷(zero-shot) 오픈-보캐뷸러리 세그멘테이션을 위한 새로운 방법을 제안합니다. 기존 연구는 주로 이미지-텍스트 쌍을 사용한 대조 학습(contrastive training)에 의존하며, 언어와 정렬되고 잘 지역화된 이미지 특징을 학습하기 위해 그룹핑 메커니즘을 활용합니다. 그러나 이는 유사한 캡션을 가진 이미지들의 시각적 외관이 종종 다르기 때문에 모호성을 초래할 수 있습니다. 대신, 우리는 대규모 텍스트-이미지 확산 모델(text-to-image diffusion models)의 생성적 특성을 활용하여 주어진 텍스트 범주에 대한 지원 이미지 집합을 샘플링합니다. 이는 주어진 텍스트에 대한 외관 분포를 제공하여 모호성 문제를 우회합니다. 또한, 우리는 샘플링된 이미지의 배경 맥락을 고려하여 객체를 더 잘 지역화하고 배경을 직접 세그멘테이션하는 메커니즘을 제안합니다. 우리의 방법은 여러 기존의 사전 훈련된 자기 지도(self-supervised) 특징 추출기를 자연어로 기반을 두고, 지원 집합의 영역으로 매핑하여 설명 가능한 예측을 제공할 수 있음을 보여줍니다. 우리의 제안은 훈련이 필요 없으며, 사전 훈련된 구성 요소만을 사용하지만, 다양한 오픈-보캐뷸러리 세그멘테이션 벤치마크에서 강력한 성능을 보이며, Pascal VOC 벤치마크에서 10% 이상의 선두를 기록합니다.

English

The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.

제로샷 오픈-보컬러리 세분화를 위한 확산 모델

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

초록

Support