CLIP作为RNN：无需训练即可分割无数视觉概念

摘要

现有的开放词汇图像分割方法需要在掩模标注和/或图像文本数据集上进行微调。掩模标签需要大量人力投入，这限制了分割数据集中类别的数量。因此，在微调后，预训练的VLM的开放词汇能力严重降低。然而，如果不进行微调，受弱图像文本监督训练的VLM在存在指向图像中不存在概念的文本查询时，往往会产生次优的掩模预测。为了缓解这些问题，我们引入了一种新颖的循环框架，逐渐过滤掉不相关的文本，并增强掩模质量，而无需进行训练。循环单元是一个基于具有冻结权重的VLM构建的两阶段分割器。因此，我们的模型保留了VLM的广泛词汇空间，并增强了其分割能力。实验结果表明，我们的方法不仅优于无需训练的对应方法，还优于使用数百万额外数据样本进行微调的方法，并为零样本语义和指代图像分割任务创造了新的最先进记录。具体而言，在Pascal VOC、COCO Object和Pascal Context上，我们将当前记录提高了28.8、16.0和6.9 mIoU。

English

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

CLIP作为RNN：无需训练即可分割无数视觉概念

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

摘要

Support