零样本开放词汇分割的扩散模型

摘要

现实世界中物体的种类几乎是无限的，因此使用在固定类别集上训练的模型无法完全捕捉。因此，近年来，开放词汇方法引起了社区的兴趣。本文提出了一种新的零样本开放词汇分割方法。先前的工作主要依赖于使用图像-文本对进行对比训练，利用分组机制来学习既与语言对齐又定位良好的图像特征。然而，这可能会引入歧义，因为具有相似标题的图像在视觉上的外观通常会有所不同。相反，我们利用大规模文本到图像扩散模型的生成特性来对给定文本类别进行支持图像集的采样。这为给定文本提供了外观分布，从而规避了歧义问题。我们进一步提出了一种考虑采样图像的上下文背景以更好地定位对象并直接分割背景的机制。我们展示了我们的方法可以用于将几种现有的预训练自监督特征提取器与自然语言联系起来，并通过映射回支持集中的区域提供可解释的预测。我们的提议无需训练，仅依赖于预训练组件，但在一系列开放词汇分割基准测试中表现出色，Pascal VOC基准测试领先超过10%。

English

The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.

零样本开放词汇分割的扩散模型

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

摘要

Support