ビデオ中の任意の動きのセグメンテーション

要旨

移動物体のセグメンテーションは、視覚シーンの高度な理解を実現するための重要なタスクであり、数多くの下流アプリケーションに応用されています。人間は、ビデオ内の移動物体を容易にセグメント化できます。従来の研究では、主にオプティカルフローを用いて動きの手がかりを提供してきましたが、部分的な動き、複雑な変形、モーションブラー、背景の妨害といった課題により、不完全な予測が生じることが多々ありました。本研究では、長距離軌道の動き手がかりとDINOベースのセマンティック特徴を組み合わせ、反復的なプロンプト戦略を通じてSAM2を活用してピクセルレベルのマスク密度化を行う、新しい移動物体セグメンテーション手法を提案します。提案モデルは、時空間軌道アテンションと動き-セマンティック分離埋め込みを採用し、動きを優先しながらセマンティックサポートを統合します。多様なデータセットでの広範なテストにより、最先端の性能を実証し、特に困難なシナリオや複数物体の細粒度セグメンテーションにおいて優れた結果を示しています。コードはhttps://motion-seg.github.io/で公開されています。

English

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

ビデオ中の任意の動きのセグメンテーション

Segment Any Motion in Videos

要旨

Support