FrozenSeg: 凍結された基盤モデルを調和させたオープン語彙セグメンテーション

要旨

オープン語彙セグメンテーションは、制約のない環境下で開放されたカテゴリ集合に属する物体をセグメンテーションし認識することを要求するため、多大な課題を抱えています。CLIPのような強力な視覚言語基盤モデルの成功を踏まえ、近年の研究ではこれらのモデルのゼロショット能力を活用して未見カテゴリを認識しようとする試みが進められてきました。顕著な性能向上が見られるものの、これらのモデルは未見カテゴリやシナリオに対する精密なマスク提案を生成するという重大な問題に依然として直面しており、最終的には不十分なセグメンテーション性能に留まっています。この課題に対処するため、我々は新規手法FrozenSegを提案します。この手法は、位置情報基盤モデル（例：SAM）からの空間的知識と、視覚言語モデル（例：CLIP）から抽出された意味的知識を相乗的フレームワークで統合するように設計されています。視覚言語モデルの視覚エンコーダを特徴量バックボーンとして採用し、空間認識特徴量を学習可能なクエリとCLIP特徴量にトランスフォーマーデコーダ内で注入します。さらに、リコール率とマスク品質をさらに向上させるためのマスク提案アンサンブル戦略を考案しました。事前学習知識を最大限に活用しつつ訓練オーバーヘッドを最小化するため、両基盤モデルを凍結し、性能ボトルネックであるマスク提案生成のための軽量トランスフォーマーデコーダのみに最適化努力を集中させます。大規模な実験により、FrozenSegがCOCOパノプティックデータのみで訓練され、ゼロショット方式で評価された様々なセグメンテーションベンチマークにおいて、最先端の結果を推進することが実証されています。コードはhttps://github.com/chenxi52/FrozenSegで公開されています。

English

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.

FrozenSeg: 凍結された基盤モデルを調和させたオープン語彙セグメンテーション

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

要旨

Support