將CLIP視為RNN：在無需訓練的情況下分割無數視覺概念

摘要

現有的開放詞彙圖像分割方法需要對遮罩標註和/或圖像-文字數據集進行微調步驟。遮罩標籤需要耗費大量人力，這限制了分割數據集中類別的數量。因此，在微調後，預先訓練的VLM的開放詞彙能力嚴重降低。然而，如果沒有進行微調，以弱圖像-文字監督進行訓練的VLM在出現引用圖像中不存在概念的文本查詢時，往往會做出次優的遮罩預測。為了緩解這些問題，我們引入了一種新穎的遞歸框架，逐步過濾掉無關的文本，增強遮罩質量而無需進行訓練努力。這個遞歸單元是基於具有凍結權重的VLM構建的兩階段分割器。因此，我們的模型保留了VLM的廣泛詞彙空間，並增強了其分割能力。實驗結果表明，我們的方法不僅優於無需訓練的對應方法，還優於使用數百萬額外數據樣本進行微調的方法，並為零樣本語義和引用圖像分割任務設立了新的最新紀錄。具體而言，在Pascal VOC、COCO Object和Pascal Context上，我們將當前紀錄提高了28.8、16.0和6.9 mIoU。

English

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

將CLIP視為RNN：在無需訓練的情況下分割無數視覺概念

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

摘要

Support