SAM2Long：トレーニング不要のメモリツリーを用いた長尺ビデオセグメンテーションのためのSAM 2 の拡張

要旨

Segment Anything Model 2（SAM 2）は、画像と動画の両方で物体セグメンテーションのための強力な基盤モデルとして登場し、さまざまな派生動画アプリケーションの道を開いています。SAM 2の動画セグメンテーションにおける重要な設計は、メモリーモジュールであり、前のフレームから現在のフレームの予測のために物体認識メモリーを促します。ただし、その貪欲選択メモリーデザインは、「エラー蓄積」問題に苦しんでおり、誤ったまたは見逃されたマスクが連鎖的に影響を与え、後続フレームのセグメンテーションに影響を与えるため、SAM 2の複雑な長期ビデオに対する性能を制限しています。このため、我々は、改良されたトレーニングフリーのビデオ物体セグメンテーション戦略であるSAM2Longを導入します。この戦略は、各フレーム内のセグメンテーションの不確実性を考慮し、制約つきツリーサーチの方法で複数のセグメンテーション経路からビデオレベルの最適な結果を選択します。実践的には、ビデオ全体を通じて一定数のセグメンテーション経路を維持します。各フレームでは、既存の経路に基づいて複数のマスクが提案され、さまざまな候補ブランチが作成されます。次に、次のフレームのための新しい経路として、累積スコアがより高い同じ一定数のブランチを選択します。最終フレームを処理した後、最も高い累積スコアを持つ経路が最終的なセグメンテーション結果として選択されます。ヒューリスティックサーチデザインの恩恵を受けて、SAM2Longは、遮蔽物や物体の再出現に対して堅牢であり、複雑な長期ビデオの物体を効果的にセグメンテーションおよびトラッキングすることができます。特筆すべきは、SAM2Longが、SA-VやLVOSなどの長期ビデオ物体セグメンテーションのベンチマークで、24の対戦比較全体で平均3.0ポイントの改善を達成し、J＆Fで最大5.3ポイントの利益を上げていることです。コードはhttps://github.com/Mark12Ding/SAM2Longで公開されています。

English

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.

SAM2Long：トレーニング不要のメモリツリーを用いた長尺ビデオセグメンテーションのためのSAM 2 の拡張

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

要旨

Support