컨볼루션은 죽지 않는다: 단일 고정 컨볼루션 CLIP을 활용한 오픈-보캐블러리 세그멘테이션

초록

오픈-보커블러리 분할(Open-vocabulary segmentation)은 오픈 세트 카테고리에서 객체를 분할하고 인식해야 하는 어려운 작업입니다. 이 문제를 해결하기 위한 한 가지 방법은 CLIP과 같은 다중 모달 모델을 활용하여 이미지와 텍스트 특징을 공유 임베딩 공간에 제공함으로써 폐쇄형 보커블러리와 오픈 보커블러리 인식 간의 격차를 줄이는 것입니다. 따라서 기존 방법들은 주로 두 단계 프레임워크를 채택하여 문제를 해결하는데, 입력이 먼저 마스크 생성기를 통과한 후 예측된 마스크와 함께 CLIP 모델을 거치는 방식입니다. 이 과정에서는 이미지에서 특징을 여러 번 추출해야 하기 때문에 비효율적이고 비용이 많이 듭니다. 이에 반해, 우리는 공유된 Frozen Convolutional CLIP 백본을 사용하여 모든 것을 단일 단계 프레임워크로 통합하는 방법을 제안합니다. 이는 현재의 두 단계 파이프라인을 크게 단순화할 뿐만 아니라 더 나은 정확도-비용 트레이드오프를 제공합니다. 제안된 FC-CLIP은 다음과 같은 관찰에서 이점을 얻습니다: 고정된(frozen) CLIP 백본은 오픈 보커블러리 분류 능력을 유지하면서도 강력한 마스크 생성기 역할을 할 수 있으며, 컨볼루셔널 CLIP은 대조적 이미지-텍스트 사전 학습에서 사용된 것보다 더 큰 입력 해상도에 잘 일반화됩니다. COCO 팬옵틱 데이터만으로 학습하고 제로샷 방식으로 테스트할 때, FC-CLIP은 ADE20K에서 26.8 PQ, 16.8 AP, 34.1 mIoU, Mapillary Vistas에서 18.2 PQ, 27.9 mIoU, Cityscapes에서 44.0 PQ, 26.8 AP, 56.2 mIoU를 달성하며, 각각 ADE20K에서 +4.2 PQ, +2.4 AP, +4.2 mIoU, Mapillary Vistas에서 +4.0 PQ, Cityscapes에서 +20.1 PQ로 기존 최신 기술을 능가합니다. 또한, FC-CLIP의 학습 및 테스트 시간은 동일한 기존 기술보다 7.5배 및 6.6배 빠르며, 5.9배 적은 파라미터를 사용합니다. FC-CLIP은 또한 다양한 오픈 보커블러리 의미론적 분할 데이터셋에서 새로운 최첨단 성능을 설정합니다. 코드는 https://github.com/bytedance/fc-clip에서 확인할 수 있습니다.

English

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

컨볼루션은 죽지 않는다: 단일 고정 컨볼루션 CLIP을 활용한 오픈-보캐블러리 세그멘테이션

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

초록

Support