SAM2Long：利用无需训练的记忆树增强SAM 2以实现长视频分割

摘要

Segment Anything Model 2 (SAM 2)已成为图像和视频中物体分割的强大基础模型，为各种下游视频应用铺平了道路。SAM 2在视频分割中的关键设计是其记忆模块，它从先前帧中的物体感知记忆中获取当前帧的预测。然而，其贪婪选择记忆设计存在“错误累积”问题，即一个错误或遗漏的蒙版会级联并影响随后帧的分割，从而限制了SAM 2在复杂长期视频中的性能。为此，我们引入了SAM2Long，一种改进的无需训练的视频物体分割策略，它考虑了每帧内的分割不确定性，并以受限制的树搜索方式从多个分割路径中选择视频级最佳结果。在实践中，我们在整个视频中保持固定数量的分割路径。对于每帧，基于现有路径提出多个蒙版，创建各种候选分支。然后，我们选择具有更高累积分数的相同固定数量的分支作为下一帧的新路径。在处理最终帧之后，选择具有最高累积分数的路径作为最终分割结果。由于其启发式搜索设计，SAM2Long对遮挡和物体重新出现具有鲁棒性，并且能够有效地分割和跟踪复杂长期视频中的物体。值得注意的是，SAM2Long在所有24个头对头比较中平均改进了3.0个点，在长期视频物体分割基准测试（如SA-V和LVOS）中J&F提高了高达5.3个点。代码已发布在https://github.com/Mark12Ding/SAM2Long。

English

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.

SAM2Long：利用无需训练的记忆树增强SAM 2以实现长视频分割

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

摘要

Support