零樣本開放詞彙分割的擴散模型

摘要

現實世界中的物件種類幾乎是無限的，因此使用在固定類別集上訓練的模型來捕捉是不可能的。因此，近年來，開放詞彙的方法引起了學術界的興趣。本文提出了一種新的方法，用於零樣本開放詞彙分割。先前的研究主要依賴對比訓練，使用圖像-文本對，利用分組機制來學習既與語言對齊又定位良好的圖像特徵。然而，這可能會引入模糊性，因為具有相似標題的圖像在視覺上的外觀通常不同。相反，我們利用大規模文本到圖像擴散模型的生成特性來對給定文本類別採樣一組支援圖像。這為給定文本提供了外觀的分佈，避開了模糊性問題。我們進一步提出了一種機制，考慮採樣圖像的上下文背景，以更好地定位物件並直接分割背景。我們展示了我們的方法可以用於將幾個現有的預訓練自監督特徵提取器與自然語言相關聯，並通過映射回支援集中的區域提供可解釋的預測。我們的提議無需訓練，僅依賴預訓練組件，但在一系列開放詞彙分割基準測試中表現出色，Pascal VOC基準測試的領先優勢超過10%。

English

The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.

零樣本開放詞彙分割的擴散模型

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

摘要

Support