开放词汇SAM：交互式地分割和识别两万个类别

摘要

CLIP和Segment Anything Model（SAM）是卓越的视觉基础模型（VFMs）。SAM在各个领域的分割任务中表现出色，而CLIP以其零样本识别能力而闻名。本文深入探讨了将这两个模型整合到一个统一框架中的方法。具体而言，我们介绍了Open-Vocabulary SAM，这是一个受SAM启发的模型，旨在实现同时交互式分割和识别，利用两个独特的知识转移模块：SAM2CLIP和CLIP2SAM。前者通过蒸馏和可学习的Transformer适配器将SAM的知识调整到CLIP中，而后者将CLIP的知识转移到SAM，增强其识别能力。在各种数据集和检测器上进行的大量实验显示，Open-Vocabulary SAM在分割和识别任务中的有效性，明显优于简单组合SAM和CLIP的朴素基线。此外，在辅以图像分类数据训练的情况下，我们的方法可以分割和识别大约22,000个类别。

English

The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the naive baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.

开放词汇SAM：交互式地分割和识别两万个类别

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

摘要

Support