開放詞彙 SAM：互動式地分割和識別兩萬個類別

摘要

CLIP和Segment Anything Model（SAM）是卓越的視覺基礎模型（VFMs）。SAM在各個領域的分割任務中表現卓越，而CLIP則以其零-shot識別能力而聞名。本文深入探討將這兩個模型整合到統一框架中的方法。具體而言，我們介紹了Open-Vocabulary SAM，這是一個受SAM啟發的模型，旨在實現同時交互式分割和識別，利用兩個獨特的知識轉移模塊：SAM2CLIP和CLIP2SAM。前者通過蒸餾和可學習的Transformer適配器將SAM的知識轉移到CLIP中，而後者將CLIP的知識轉移到SAM，增強其識別能力。在各種數據集和檢測器上進行了大量實驗，結果顯示Open-Vocabulary SAM在分割和識別任務中的有效性，明顯優於僅將SAM和CLIP簡單組合的基線方法。此外，在圖像分類數據訓練的幫助下，我們的方法可以分割和識別大約22,000個類別。

English

The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the naive baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.

開放詞彙 SAM：互動式地分割和識別兩萬個類別

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

摘要

Support