畳み込みは不死身：単一凍結畳み込みCLIPによるオープン語彙セグメンテーション

要旨

オープンボキャブラリセグメンテーションは、オープンなカテゴリセットから物体をセグメント化し認識するという難しい課題です。この課題に対処する一つの方法は、CLIPのようなマルチモーダルモデルを活用し、画像とテキストの特徴を共有埋め込み空間で提供することで、クローズドボキャブラリとオープンボキャブラリ認識のギャップを埋めることです。したがって、既存の手法では、入力がまずマスク生成器を通り、その後予測されたマスクと共にCLIPモデルを通るという二段階のフレームワークを採用することが多いです。このプロセスでは、画像から複数回特徴を抽出する必要があり、非効率で非効果的です。これに対して、我々は共有のFrozen Convolutional CLIPバックボーンを使用して全てを一段階のフレームワークに統合することを提案します。これにより、現在の二段階パイプラインを大幅に簡素化するだけでなく、精度とコストのトレードオフを著しく向上させることができます。提案するFC-CLIPは、以下の観察から恩恵を受けています：凍結されたCLIPバックボーンはオープンボキャブラリ分類の能力を維持し、強力なマスク生成器としても機能し、畳み込みCLIPはコントラスティブな画像-テキスト事前学習で使用された解像度よりも大きな入力解像度にうまく一般化します。COCOパノプティックデータのみで訓練し、ゼロショット方式でテストした場合、FC-CLIPはADE20Kで26.8 PQ、16.8 AP、34.1 mIoU、Mapillary Vistasで18.2 PQ、27.9 mIoU、Cityscapesで44.0 PQ、26.8 AP、56.2 mIoUを達成し、ADE20Kでは+4.2 PQ、+2.4 AP、+4.2 mIoU、Mapillary Vistasでは+4.0 PQ、Cityscapesでは+20.1 PQと、従来の技術を上回りました。さらに、FC-CLIPの訓練とテスト時間は、同じ従来技術よりも7.5倍と6.6倍大幅に高速で、パラメータ数も5.9倍少ないです。FC-CLIPはまた、様々なオープンボキャブラリセマンティックセグメンテーションデータセットで新たな最先端の性能を確立しました。コードはhttps://github.com/bytedance/fc-clipにあります。

English

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

畳み込みは不死身：単一凍結畳み込みCLIPによるオープン語彙セグメンテーション

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

要旨

Support