SAM-CLIP：將視覺基礎模型合併以實現語義和空間理解

摘要

公開可用的視覺基礎模型（VFMs）的範圍正在迅速擴大，例如CLIP和Segment Anything Model（SAM）。VFMs具有獨特的能力，源於它們的預訓練目標。例如，CLIP在語義理解方面表現出色，而SAM專注於分割的空間理解。在這項工作中，我們介紹了一個簡單的方法，可以有效地將VFMs合併成一個統一的模型，吸收它們的專業知識。我們提出的方法整合了多任務學習、持續學習技術和師生蒸餾。與從頭開始進行傳統多任務訓練相比，這種策略需要的計算成本顯著較少。此外，它只需要最初用於訓練單個模型的預訓練數據集的一小部分。通過將我們的方法應用於SAM和CLIP，我們得到了SAM-CLIP：一個統一的模型，將SAM和CLIP的優勢融合為一個單一的骨幹，使其適用於邊緣設備應用。我們展示了SAM-CLIP學習到了更豐富的視覺表示，具備定位和語義特徵，適用於各種視覺任務。與SAM和CLIP相比，SAM-CLIP在幾個頭部探測任務上取得了改進的性能。我們進一步展示，SAM-CLIP不僅保留了其前身模型的基本優勢，還引入了協同功能，尤其是在零樣本語義分割方面，SAM-CLIP在5個基準測試中取得了新的最先進結果。在Pascal-VOC和COCO-Stuff數據集上，它的性能優於先前專門設計用於此任務的模型，分別提高了+6.8%和+5.9%的平均IoU。

English

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

SAM-CLIP：將視覺基礎模型合併以實現語義和空間理解

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

摘要

Support