SAM-CLIP: 시맨틱 및 공간 이해를 위한 비전 파운데이션 모델의 융합

초록

공개적으로 이용 가능한 비전 파운데이션 모델(VFM)의 풍경은 CLIP과 Segment Anything Model(SAM)과 같은 모델들이 빠르게 확장되고 있습니다. VFM은 사전 학습 목표에서 비롯된 독특한 능력을 갖추고 있습니다. 예를 들어, CLIP은 의미 이해에 뛰어나고, SAM은 분할을 위한 공간 이해에 특화되어 있습니다. 본 연구에서는 VFM을 통합 모델로 효율적으로 병합하여 그들의 전문성을 흡수하는 간단한 방법을 소개합니다. 우리가 제안한 방법은 다중 작업 학습, 지속 학습 기술, 그리고 교사-학생 증류를 통합합니다. 이 전략은 전통적인 다중 작업 학습에 비해 상당히 적은 계산 비용을 요구합니다. 또한, 개별 모델을 처음 훈련하는 데 사용된 사전 학습 데이터셋의 작은 부분만 필요로 합니다. 우리의 방법을 SAM과 CLIP에 적용하여 SAM-CLIP을 도출했습니다: SAM과 CLIP의 강점을 단일 백본에 통합한 통합 모델로, 이는 에지 디바이스 애플리케이션에 적합합니다. SAM-CLIP은 더 풍부한 시각적 표현을 학습하며, 위치 정보와 의미 특징을 모두 갖추어 다양한 비전 작업에 적합합니다. SAM-CLIP은 SAM과 CLIP과 비교하여 여러 헤드 프로빙 작업에서 향상된 성능을 보입니다. 우리는 더 나아가 SAM-CLIP이 선행 모델들의 기본 강점을 유지할 뿐만 아니라 시너지 기능을 도입함을 보여주며, 특히 제로샷 의미 분할에서 SAM-CLIP이 5개의 벤치마크에서 새로운 최첨단 결과를 달성합니다. 이는 이 작업을 위해 특별히 설계된 이전 모델들을 큰 차이로 능가하며, Pascal-VOC와 COCO-Stuff 데이터셋에서 각각 +6.8%와 +5.9%의 평균 IoU 향상을 보입니다.

English

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

SAM-CLIP: 시맨틱 및 공간 이해를 위한 비전 파운데이션 모델의 융합

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

초록

Support