Segment Anything Meets Point Tracking

초록

Segment Anything Model(SAM)은 점과 같은 인터랙티브 프롬프트를 사용하여 마스크를 생성하는 강력한 제로샷 이미지 세그멘테이션 모델로 자리 잡았습니다. 본 논문은 SAM의 기능을 동적 비디오에서의 추적 및 세그멘테이션으로 확장하는 SAM-PT 방법을 소개합니다. SAM-PT는 마스크 생성을 위해 강력하고 희소한 점 선택 및 전파 기술을 활용하며, SAM 기반의 세그멘테이션 트래커가 DAVIS, YouTube-VOS, MOSE와 같은 인기 있는 비디오 객체 세그멘테이션 벤치마크에서 강력한 제로샷 성능을 보일 수 있음을 입증합니다. 전통적인 객체 중심의 마스크 전파 전략과 비교하여, 우리는 객체 의미론에 구애받지 않는 지역 구조 정보를 활용하기 위해 점 전파를 독창적으로 사용합니다. 제로샷 오픈 월드 Unidentified Video Objects(UVO) 벤치마크에 대한 직접 평가를 통해 점 기반 추적의 장점을 강조합니다. 우리의 접근 방식을 더욱 강화하기 위해, K-Medoids 클러스터링을 사용하여 점 초기화를 수행하고, 대상 객체를 명확히 구분하기 위해 양성 및 음성 점을 모두 추적합니다. 또한, 마스크 정제를 위해 다중 마스크 디코딩 패스를 사용하고, 추적 정확도를 향상시키기 위해 점 재초기화 전략을 고안합니다. 우리의 코드는 다양한 점 트래커와 비디오 세그멘테이션 벤치마크를 통합하며, https://github.com/SysCV/sam-pt에서 공개될 예정입니다.

English

The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM's capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further enhance our approach, we utilize K-Medoids clustering for point initialization and track both positive and negative points to clearly distinguish the target object. We also employ multiple mask decoding passes for mask refinement and devise a point re-initialization strategy to improve tracking accuracy. Our code integrates different point trackers and video segmentation benchmarks and will be released at https://github.com/SysCV/sam-pt.