Las Convoluciones Mueren Difícilmente: Segmentación de Vocabulario Abierto con un CLIP Convolucional Único Congelado

Resumen

La segmentación de vocabulario abierto es una tarea desafiante que requiere segmentar y reconocer objetos de un conjunto abierto de categorías. Una forma de abordar este desafío es aprovechar modelos multimodales, como CLIP, para proporcionar características de imagen y texto en un espacio de incrustación compartido, lo que reduce la brecha entre el reconocimiento de vocabulario cerrado y abierto. Por lo tanto, los métodos existentes suelen adoptar un marco de dos etapas para resolver el problema, donde las entradas primero pasan por un generador de máscaras y luego por el modelo CLIP junto con las máscaras predichas. Este proceso implica extraer características de las imágenes múltiples veces, lo que puede ser ineficaz e ineficiente. En contraste, proponemos construir todo en un marco de una sola etapa utilizando una arquitectura compartida de CLIP Convolucional Congelado (Frozen Convolutional CLIP), lo que no solo simplifica significativamente la actual pipeline de dos etapas, sino que también ofrece un mejor equilibrio entre precisión y costo. El FC-CLIP propuesto se beneficia de las siguientes observaciones: la arquitectura congelada de CLIP mantiene la capacidad de clasificación de vocabulario abierto y también puede servir como un generador de máscaras robusto, y el CLIP convolucional se generaliza bien a una resolución de entrada mayor que la utilizada durante el preentrenamiento contrastivo de imagen-texto. Al entrenar únicamente con datos panópticos de COCO y probar de manera zero-shot, FC-CLIP logra 26.8 PQ, 16.8 AP y 34.1 mIoU en ADE20K; 18.2 PQ y 27.9 mIoU en Mapillary Vistas; y 44.0 PQ, 26.8 AP y 56.2 mIoU en Cityscapes, superando el estado del arte en +4.2 PQ, +2.4 AP y +4.2 mIoU en ADE20K, +4.0 PQ en Mapillary Vistas y +20.1 PQ en Cityscapes, respectivamente. Además, el tiempo de entrenamiento y prueba de FC-CLIP es 7.5x y 6.6x más rápido que el mismo estado del arte, mientras utiliza 5.9x menos parámetros. FC-CLIP también establece un nuevo rendimiento de vanguardia en varios conjuntos de datos de segmentación semántica de vocabulario abierto. Código disponible en https://github.com/bytedance/fc-clip.

English

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

Las Convoluciones Mueren Difícilmente: Segmentación de Vocabulario Abierto con un CLIP Convolucional Único Congelado

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Resumen

Support