視頻中的任意運動分割

摘要

移動物體分割是實現高層次視覺場景理解的關鍵任務，並具有眾多下游應用。人類能夠輕鬆地在視頻中分割移動物體。以往的研究主要依賴於光流來提供運動線索；然而，這種方法由於部分運動、複雜變形、運動模糊和背景干擾等挑戰，往往導致不完美的預測。我們提出了一種新穎的移動物體分割方法，該方法結合了長程軌跡運動線索與基於DINO的語義特徵，並利用SAM2通過迭代提示策略進行像素級掩碼密集化。我們的模型採用時空軌跡注意力和運動-語義解耦嵌入，以優先考慮運動，同時整合語義支持。在多樣化數據集上的廣泛測試展示了最先進的性能，在具有挑戰性的場景和多物體的細粒度分割中表現出色。我們的代碼可在https://motion-seg.github.io/獲取。

English

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

視頻中的任意運動分割

Segment Any Motion in Videos

摘要

Support