ChatPaper.aiChatPaper

FrozenSeg:融合凍結基礎模型實現開放詞彙分割的協調統一

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

September 5, 2024
作者: Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao
cs.AI

摘要

開放詞彙分割技術面臨重大挑戰,因其需要在非受限環境中對開放類別集合的物體進行分割與識別。基於CLIP等強大視覺語言基礎模型的成功,近期研究試圖利用其零樣本能力來識別未見過的類別。儘管性能顯著提升,這些模型仍存在關鍵問題:難以針對未見類別和場景生成精確的遮罩提案,最終導致分割性能欠佳。為解決此難題,我們提出創新方法FrozenSeg,通過協同框架整合定位基礎模型(如SAM)的空間知識與視覺語言模型(如CLIP)的語義知識。以視覺語言模型的視覺編碼器作為特徵主幹網絡,我們將空間感知特徵注入可學習查詢向量和轉譯器解碼器中的CLIP特徵。此外,我們設計了遮罩提案集成策略,進一步提升召回率與遮罩品質。為充分挖掘預訓練知識同時最小化訓練開銷,我們凍結兩個基礎模型,僅針對性能瓶頸——用於生成遮罩提案的輕量級轉譯器解碼器進行優化。大量實驗表明,FrozenSeg在僅使用COCO全景數據訓練、並以零樣本方式測試的條件下,於多個分割基準測試中刷新了最先進成果。程式碼已開源於:https://github.com/chenxi52/FrozenSeg。
English
Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.
PDF122November 14, 2024