SAM-CLIP: セマンティック理解と空間理解に向けたビジョン基盤モデルの統合

要旨

公開されている視覚基盤モデル（VFMs）、例えばCLIPやSegment Anything Model（SAM）の状況は急速に拡大しています。VFMsは、その事前学習の目的に由来する独自の能力を備えています。例えば、CLIPは意味理解に優れ、SAMはセグメンテーションのための空間理解に特化しています。本研究では、VFMsを統合し、その専門知識を吸収した統一モデルを効率的に作成するためのシンプルな手法を紹介します。提案手法は、マルチタスク学習、継続学習技術、および教師-生徒蒸留を統合しています。この戦略は、従来のマルチタスク学習をゼロから行う場合と比較して、大幅に少ない計算コストで済みます。さらに、個々のモデルを訓練するために最初に使用された事前学習データセットのごく一部しか必要としません。SAMとCLIPに本手法を適用することで、SAM-CLIPを導出しました。SAM-CLIPは、SAMとCLIPの強みを単一のバックボーンに統合した統一モデルであり、エッジデバイスアプリケーションに適しています。SAM-CLIPは、ローカライゼーションと意味的特徴の両方を備えたより豊かな視覚表現を学習し、幅広い視覚タスクに適していることを示します。SAM-CLIPは、SAMやCLIPと比較して、いくつかのヘッドプロービングタスクで改善された性能を達成します。さらに、SAM-CLIPは、前身モデルの基本的な強みを保持するだけでなく、相乗的な機能も導入し、特にゼロショットセマンティックセグメンテーションにおいて、5つのベンチマークで新たな最先端の結果を確立します。このタスクのために特別に設計された以前のモデルを大幅に上回り、Pascal-VOCとCOCO-Stuffデータセットでそれぞれ+6.8%と+5.9%の平均IoUの改善を達成しました。

English

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

SAM-CLIP: セマンティック理解と空間理解に向けたビジョン基盤モデルの統合

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

要旨

Support