Segment Anything 與點追蹤的結合
Segment Anything Meets Point Tracking
July 3, 2023
作者: Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu
cs.AI
摘要
分段萬物模型(SAM)已成為強大的零樣本圖像分割模型,其透過點位等互動式提示來生成遮罩。本文提出SAM-PT方法,將SAM的能力擴展至動態影片中的目標追蹤與分割。SAM-PT採用魯棒的稀疏點位選擇與傳播技術生成遮罩,實驗證明基於SAM的分割追蹤器能在DAVIS、YouTube-VOS及MOSE等主流影片物件分割基準測試中實現優異的零樣本性能。相較於傳統以物件為核心的遮罩傳播策略,我們創新地運用點位傳播技術來挖掘與物件語義無關的局部結構信息。透過在零樣本開放世界未識別影片物件(UVO)基準上的直接評估,我們彰顯了基於點位追蹤的優勢。為進一步優化方法,我們採用K-Medoids聚類進行點位初始化,同時追蹤正向與負向點位以清晰區分目標物件。此外,我們透過多重遮罩解碼迭代實現遮罩優化,並設計點位重新初始化策略來提升追蹤精度。我們的程式碼整合了多種點位追蹤器與影片分割基準測試,將於https://github.com/SysCV/sam-pt開源釋出。
English
The Segment Anything Model (SAM) has established itself as a powerful
zero-shot image segmentation model, employing interactive prompts such as
points to generate masks. This paper presents SAM-PT, a method extending SAM's
capability to tracking and segmenting anything in dynamic videos. SAM-PT
leverages robust and sparse point selection and propagation techniques for mask
generation, demonstrating that a SAM-based segmentation tracker can yield
strong zero-shot performance across popular video object segmentation
benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional
object-centric mask propagation strategies, we uniquely use point propagation
to exploit local structure information that is agnostic to object semantics. We
highlight the merits of point-based tracking through direct evaluation on the
zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further
enhance our approach, we utilize K-Medoids clustering for point initialization
and track both positive and negative points to clearly distinguish the target
object. We also employ multiple mask decoding passes for mask refinement and
devise a point re-initialization strategy to improve tracking accuracy. Our
code integrates different point trackers and video segmentation benchmarks and
will be released at https://github.com/SysCV/sam-pt.