Le convoluzioni non muoiono mai: Segmentazione open-vocabulary con un singolo CLIP convoluzionale congelato

Abstract

La segmentazione open-vocabulary è un compito impegnativo che richiede la segmentazione e il riconoscimento di oggetti da un insieme aperto di categorie. Un modo per affrontare questa sfida è sfruttare modelli multi-modali, come CLIP, per fornire caratteristiche di immagine e testo in uno spazio di embedding condiviso, che colma il divario tra il riconoscimento closed-vocabulary e open-vocabulary. Pertanto, i metodi esistenti spesso adottano un framework a due stadi per risolvere il problema, in cui gli input passano prima attraverso un generatore di maschere e poi attraverso il modello CLIP insieme alle maschere predette. Questo processo comporta l'estrazione di caratteristiche dalle immagini più volte, il che può essere inefficace e inefficiente. Al contrario, proponiamo di costruire tutto in un framework a singolo stadio utilizzando un backbone condiviso Frozen Convolutional CLIP, che non solo semplifica significativamente l'attuale pipeline a due stadi, ma produce anche un migliore compromesso tra accuratezza e costo. Il FC-CLIP proposto trae vantaggio dalle seguenti osservazioni: il backbone CLIP congelato mantiene la capacità di classificazione open-vocabulary e può anche fungere da forte generatore di maschere, e il CLIP convoluzionale si generalizza bene a una risoluzione di input maggiore rispetto a quella utilizzata durante il pre-addestramento contrastivo immagine-testo. Quando addestrato solo sui dati panoptic di COCO e testato in modalità zero-shot, FC-CLIP raggiunge 26.8 PQ, 16.8 AP e 34.1 mIoU su ADE20K, 18.2 PQ e 27.9 mIoU su Mapillary Vistas, 44.0 PQ, 26.8 AP e 56.2 mIoU su Cityscapes, superando lo stato dell'arte di +4.2 PQ, +2.4 AP, +4.2 mIoU su ADE20K, +4.0 PQ su Mapillary Vistas e +20.1 PQ su Cityscapes, rispettivamente. Inoltre, il tempo di addestramento e test di FC-CLIP è rispettivamente 7.5x e 6.6x più veloce rispetto allo stesso stato dell'arte, utilizzando 5.9x meno parametri. FC-CLIP stabilisce anche un nuovo stato dell'arte in termini di prestazioni su vari dataset di segmentazione semantica open-vocabulary. Codice disponibile su https://github.com/bytedance/fc-clip.

English

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

Le convoluzioni non muoiono mai: Segmentazione open-vocabulary con un singolo CLIP convoluzionale congelato

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Abstract

Support