將CLIP視為RNN:在無需訓練的情況下分割無數視覺概念
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
December 12, 2023
作者: Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
cs.AI
摘要
現有的開放詞彙圖像分割方法需要對遮罩標註和/或圖像-文字數據集進行微調步驟。遮罩標籤需要耗費大量人力,這限制了分割數據集中類別的數量。因此,在微調後,預先訓練的VLM的開放詞彙能力嚴重降低。然而,如果沒有進行微調,以弱圖像-文字監督進行訓練的VLM在出現引用圖像中不存在概念的文本查詢時,往往會做出次優的遮罩預測。為了緩解這些問題,我們引入了一種新穎的遞歸框架,逐步過濾掉無關的文本,增強遮罩質量而無需進行訓練努力。這個遞歸單元是基於具有凍結權重的VLM構建的兩階段分割器。因此,我們的模型保留了VLM的廣泛詞彙空間,並增強了其分割能力。實驗結果表明,我們的方法不僅優於無需訓練的對應方法,還優於使用數百萬額外數據樣本進行微調的方法,並為零樣本語義和引用圖像分割任務設立了新的最新紀錄。具體而言,在Pascal VOC、COCO Object和Pascal Context上,我們將當前紀錄提高了28.8、16.0和6.9 mIoU。
English
Existing open-vocabulary image segmentation methods require a fine-tuning
step on mask annotations and/or image-text datasets. Mask labels are
labor-intensive, which limits the number of categories in segmentation
datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is
severely reduced after fine-tuning. However, without fine-tuning, VLMs trained
under weak image-text supervision tend to make suboptimal mask predictions when
there are text queries referring to non-existing concepts in the image. To
alleviate these issues, we introduce a novel recurrent framework that
progressively filters out irrelevant texts and enhances mask quality without
training efforts. The recurrent unit is a two-stage segmenter built upon a VLM
with frozen weights. Thus, our model retains the VLM's broad vocabulary space
and strengthens its segmentation capability. Experimental results show that our
method outperforms not only the training-free counterparts, but also those
fine-tuned with millions of additional data samples, and sets new
state-of-the-art records for both zero-shot semantic and referring image
segmentation tasks. Specifically, we improve the current record by 28.8, 16.0,
and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.