卷積永存不朽：單凍結卷積式 CLIP 進行開放詞彙分割

摘要

開放詞彙分割是一項具有挑戰性的任務，需要從一組開放類別中對物體進行分割和識別。應對這一挑戰的一種方法是利用多模型，如CLIP，提供在共享嵌入空間中的圖像和文本特徵，從而彌合閉合詞彙和開放詞彙識別之間的差距。因此，現有方法通常採用兩階段框架來應對問題，其中輸入首先通過遮罩生成器，然後通過CLIP模型以及預測的遮罩。這個過程涉及從圖像中多次提取特徵，這可能是低效和低效率的。相比之下，我們提出將所有內容建立在一個單階段框架中，使用共享的凍結卷積CLIP骨幹，這不僅顯著簡化了當前的兩階段流程，而且在準確性和成本之間取得了更好的平衡。所提出的FC-CLIP，受益於以下觀察結果：凍結的CLIP骨幹保持了開放詞彙分類的能力，還可以作為強大的遮罩生成器，而卷積CLIP對比圖像-文本預訓練期間使用的較小輸入分辨率具有良好的泛化能力。僅在COCO全景數據上進行訓練並以零樣本方式進行測試時，FC-CLIP在ADE20K上實現了26.8 PQ，16.8 AP和34.1 mIoU，在Mapillary Vistas上實現了18.2 PQ，27.9 mIoU，在Cityscapes上實現了44.0 PQ，26.8 AP，56.2 mIoU，分別比先前技術高出+4.2 PQ，+2.4 AP，+4.2 mIoU在ADE20K上，+4.0 PQ在Mapillary Vistas上，以及+20.1 PQ在Cityscapes上。此外，FC-CLIP的訓練和測試時間分別比相同的先前技術快了7.5倍和6.6倍，同時使用的參數減少了5.9倍。FC-CLIP還在各種開放詞彙語義分割數據集上設定了新的最先進性能水平。代碼位於https://github.com/bytedance/fc-clip

English

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

卷積永存不朽：單凍結卷積式 CLIP 進行開放詞彙分割

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

摘要

Support