ChatPaper.aiChatPaper

卷积永存:使用单个冻结卷积CLIP进行开放词汇分割

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

August 4, 2023
作者: Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
cs.AI

摘要

开放词汇分割是一项具有挑战性的任务,需要从一个开放类别集中对对象进行分割和识别。解决这一挑战的一种方法是利用多模态模型,如CLIP,提供共享嵌入空间中的图像和文本特征,从而弥合封闭词汇和开放词汇识别之间的差距。因此,现有方法通常采用两阶段框架来解决问题,其中输入首先经过一个蒙版生成器,然后通过CLIP模型以及预测的蒙版。这个过程涉及多次从图像中提取特征,可能是低效且低效率的。相比之下,我们提出将所有内容构建成一个单阶段框架,使用共享的冻结卷积CLIP骨干网络,不仅显著简化了当前的两阶段流程,而且在准确性和成本之间取得了更好的平衡。所提出的FC-CLIP,受益于以下观察结果:冻结的CLIP骨干网络保持了开放词汇分类的能力,也可以作为强大的蒙版生成器,卷积CLIP在比对比图像文本预训练中使用的输入分辨率更大的情况下具有很好的泛化能力。当仅在COCO panoptic数据上进行训练并以零样本方式进行测试时,FC-CLIP在ADE20K上实现了26.8 PQ,16.8 AP和34.1 mIoU,在Mapillary Vistas上实现了18.2 PQ,27.9 mIoU,在Cityscapes上实现了44.0 PQ,26.8 AP和56.2 mIoU,分别比先前技术提高了+4.2 PQ,+2.4 AP,+4.2 mIoU在ADE20K上,+4.0 PQ在Mapillary Vistas上,+20.1 PQ在Cityscapes上。此外,FC-CLIP的训练和测试时间分别比相同的先前技术快了7.5倍和6.6倍,同时使用的参数少了5.9倍。FC-CLIP还在各种开放词汇语义分割数据集上树立了新的最先进性能水平。代码位于https://github.com/bytedance/fc-clip
English
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip
PDF130December 15, 2024