SAM-CLIP：将视觉基础模型合并为语义和空间理解

摘要

公开可用的视觉基础模型（VFMs）的领域，例如CLIP和Segment Anything Model（SAM），正在迅速扩展。VFMs具有独特的能力，源自它们的预训练目标。例如，CLIP在语义理解方面表现出色，而SAM专注于分割的空间理解。在这项工作中，我们介绍了一种简单的方法，可以高效地将VFMs合并成一个统一模型，吸收它们的专业知识。我们提出的方法整合了多任务学习、持续学习技术和师生蒸馏。与从头开始的传统多任务训练相比，这种策略需要的计算成本显著较少。此外，它只需要最初用于训练单个模型的预训练数据集的一小部分。通过将我们的方法应用于SAM和CLIP，我们得到了SAM-CLIP：一个统一模型，将SAM和CLIP的优势融合为一个单一的骨干，使其适用于边缘设备应用。我们展示SAM-CLIP学习到了更丰富的视觉表示，具备定位和语义特征，适用于广泛的视觉任务。与SAM和CLIP相比，SAM-CLIP在几项头部探测任务上表现出更好的性能。我们进一步展示，SAM-CLIP不仅保留了其前身模型的基础优势，还引入了协同功能，尤其是在零样本语义分割方面，SAM-CLIP在5个基准测试上取得了新的最先进结果。与之前专门为此任务设计的模型相比，SAM-CLIP的表现大幅领先，分别在Pascal-VOC和COCO-Stuff数据集上提高了+6.8%和+5.9%的平均IoU。

English

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

SAM-CLIP：将视觉基础模型合并为语义和空间理解

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

摘要

Support