ChatPaper.aiChatPaper

CLIP作为RNN:无需训练即可分割无数视觉概念

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

December 12, 2023
作者: Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
cs.AI

摘要

现有的开放词汇图像分割方法需要在掩模标注和/或图像文本数据集上进行微调。掩模标签需要大量人力投入,这限制了分割数据集中类别的数量。因此,在微调后,预训练的VLM的开放词汇能力严重降低。然而,如果不进行微调,受弱图像文本监督训练的VLM在存在指向图像中不存在概念的文本查询时,往往会产生次优的掩模预测。为了缓解这些问题,我们引入了一种新颖的循环框架,逐渐过滤掉不相关的文本,并增强掩模质量,而无需进行训练。循环单元是一个基于具有冻结权重的VLM构建的两阶段分割器。因此,我们的模型保留了VLM的广泛词汇空间,并增强了其分割能力。实验结果表明,我们的方法不仅优于无需训练的对应方法,还优于使用数百万额外数据样本进行微调的方法,并为零样本语义和指代图像分割任务创造了新的最先进记录。具体而言,在Pascal VOC、COCO Object和Pascal Context上,我们将当前记录提高了28.8、16.0和6.9 mIoU。
English
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
PDF190December 15, 2024