任意分割遇见点追踪

摘要

分割任意物体模型（SAM）已经成为一种强大的零样本图像分割模型，利用诸如点之类的交互提示来生成蒙版。本文介绍了SAM-PT，这是一种将SAM的能力扩展到跟踪和分割动态视频中任何物体的方法。SAM-PT利用稳健且稀疏的点选择和传播技术进行蒙版生成，表明基于SAM的分割跟踪器可以在流行的视频对象分割基准上取得强大的零样本性能，包括DAVIS、YouTube-VOS和MOSE。与传统的以对象为中心的蒙版传播策略相比，我们独特地使用点传播来利用与对象语义无关的局部结构信息。我们通过在零样本开放世界未知视频对象（UVO）基准上进行直接评估，突出了基于点的跟踪的优点。为了进一步改进我们的方法，我们利用K-Medoids聚类进行点初始化，并跟踪正负点以清晰区分目标对象。我们还采用多次蒙版解码传递进行蒙版细化，并设计了一种点重新初始化策略以提高跟踪精度。我们的代码集成了不同的点跟踪器和视频分割基准，并将在https://github.com/SysCV/sam-pt 上发布。

English

The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM's capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further enhance our approach, we utilize K-Medoids clustering for point initialization and track both positive and negative points to clearly distinguish the target object. We also employ multiple mask decoding passes for mask refinement and devise a point re-initialization strategy to improve tracking accuracy. Our code integrates different point trackers and video segmentation benchmarks and will be released at https://github.com/SysCV/sam-pt.

任意分割遇见点追踪

Segment Anything Meets Point Tracking

摘要

Support