ChatPaper.aiChatPaper

FrozenSeg:融合冻结基础模型实现开放词汇分割

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

September 5, 2024
作者: Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao
cs.AI

摘要

开放词汇分割面临重大挑战,因为它需要在不受限的环境中,对开放类别集合中的物体进行分割与识别。基于强大的视觉-语言(ViL)基础模型(如CLIP)的成功,近期研究致力于利用其零样本能力来识别未见类别。尽管性能显著提升,这些模型在生成未见类别和场景的精确掩码提议时仍遇到关键问题,最终导致分割性能不佳。为解决这一难题,我们提出了一种新颖方法——FrozenSeg,旨在协同整合来自定位基础模型(如SAM)的空间知识与从ViL模型(如CLIP)提取的语义知识。以ViL模型的视觉编码器作为特征骨干,我们将空间感知特征注入到可学习查询和Transformer解码器内的CLIP特征中。此外,我们设计了一种掩码提议集成策略,以进一步提高召回率和掩码质量。为了充分利用预训练知识同时最小化训练开销,我们冻结了两个基础模型,仅专注于优化轻量级Transformer解码器以生成掩码提议——这是性能瓶颈所在。大量实验表明,FrozenSeg在仅使用COCO全景数据训练并以零样本方式测试的情况下,在多种分割基准上推进了最先进的结果。代码可在https://github.com/chenxi52/FrozenSeg获取。
English
Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.
PDF122November 14, 2024