DC-SAM: 이중 일관성을 통한 이미지 및 비디오 내 컨텍스트 기반 세그멘테이션

초록

단일 레이블 예제가 주어졌을 때, 인-컨텍스트 세그멘테이션은 해당 객체를 분할하는 것을 목표로 합니다. 이 설정은 퓨샷 러닝에서 원샷 세그멘테이션으로 알려져 있으며, 세그멘테이션 모델의 일반화 능력을 탐구하며 장면 이해 및 이미지/비디오 편집을 포함한 다양한 비전 작업에 적용되어 왔습니다. 최근 세그먼트 애니씽 모델(Segment Anything Models, SAM)이 인터랙티브 세그멘테이션에서 최첨단 결과를 달성했지만, 이러한 접근 방식은 인-컨텍스트 세그멘테이션에 직접적으로 적용할 수 없습니다. 본 연구에서는 이미지와 비디오 모두에 대한 인-컨텍스트 세그멘테이션을 위해 SAM과 SAM2를 적응시키기 위해 프롬프트 튜닝 기반의 듀얼 일관성 SAM(DC-SAM) 방법을 제안합니다. 우리의 핵심 통찰은 고품질 시각적 프롬프트를 제공하여 SAM의 프롬프트 인코더의 특징을 강화하는 것입니다. 마스크 사전을 생성할 때, SAM 특징을 융합하여 프롬프트 인코더를 더 잘 정렬합니다. 그런 다음, 융합된 특징과 초기 시각적 프롬프트에 대해 순환 일관성 크로스-어텐션을 설계합니다. 다음으로, 프롬프트 인코더에서 판별적 긍정 및 부정 프롬프트를 사용하여 듀얼 브랜치 설계를 제공합니다. 또한, 우리는 제안된 듀얼 일관성 방법을 마스크 튜브에 적용하기 위해 간단한 마스크-튜브 훈련 전략을 설계합니다. 제안된 DC-SAM은 주로 이미지를 위해 설계되었지만, SAM2의 지원으로 비디오 도메인으로 원활하게 확장할 수 있습니다. 비디오 도메인에서 인-컨텍스트 세그멘테이션이 부재함에 따라, 우리는 기존 비디오 세그멘테이션 데이터셋에서 첫 번째 벤치마크를 수동으로 선별 및 구축하여, 모델의 인-컨텍스트 능력을 더 잘 평가하기 위해 인-컨텍스트 비디오 객체 세그멘테이션(In-Context Video Object Segmentation, IC-VOS)을 명명했습니다. 광범위한 실험을 통해 우리의 방법이 COCO-20i에서 55.5 (+1.4) mIoU, PASCAL-5i에서 73.0 (+1.1) mIoU, 그리고 제안된 IC-VOS 벤치마크에서 71.52의 J&F 점수를 달성함을 입증했습니다. 우리의 소스 코드와 벤치마크는 https://github.com/zaplm/DC-SAM에서 확인할 수 있습니다.

English

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM: 이중 일관성을 통한 이미지 및 비디오 내 컨텍스트 기반 세그멘테이션

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

초록

Support