SAM2Matting: 汎用画像および動画マッティング

要旨

画像マッティングにおける目覚ましい進歩にもかかわらず、ビデオマッティングは依然として困難を伴う。これは、フレーム単位の理解を必要とする高レベルのトラッキングと、極めて微細な詳細に焦点を当てた低レベルのマッティングとの間の本質的なギャップに起因する。既存の手法では、費用がかかり適用範囲の狭いビデオマッティングデータセットを用いてこれに対処しようとするが、これによりドメイン外の汎化が制限され、トラッキングのロバスト性が損なわれる可能性がある。我々は、SAM2Mattingを用いてこのパラダイムを再考する。これは、VOSトラッカーを高忠実度のビデオマッティングに進化させるトラッカー・トゥ・マッティングフレームワークである。具体的には、タスクを分離し、基礎的なトラッカー（例：SAM2、SAM3）を領域提案ブリッジと専用のマッティングヘッドで強化することで、妥協のないトラッカーが時間的一貫性を処理し、マッティングコンポーネントが微細な詳細を解決することを可能にする。特筆すべきは、画像のみで学習されているにもかかわらず、SAM2Mattingはビデオマッティングにおいて新たな最先端性能を確立し、多様なプロンプトタイプをサポートし、強い時間的一貫性を維持し、人間中心および非制御環境の両方のシナリオでロバストな汎化を示すことである。

English

Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.