비디오에서 모든 움직임 분할하기

초록

움직이는 객체 분할은 시각적 장면에 대한 고수준의 이해를 달성하기 위한 중요한 과제이며, 다양한 하위 응용 프로그램을 가지고 있습니다. 인간은 비디오에서 움직이는 객체를 쉽게 분할할 수 있습니다. 기존 연구는 주로 광학 흐름을 사용하여 움직임 단서를 제공했지만, 이 방법은 부분적 움직임, 복잡한 변형, 움직임 흐림 및 배경 방해와 같은 문제로 인해 불완전한 예측 결과를 초래하는 경우가 많았습니다. 우리는 장거리 궤적 움직임 단서와 DINO 기반의 의미론적 특징을 결합하고, SAM2를 활용한 반복적인 프롬프트 전략을 통해 픽셀 수준의 마스크 밀집화를 수행하는 새로운 움직이는 객체 분할 접근 방식을 제안합니다. 우리의 모델은 공간-시간 궤적 주의(Spatio-Temporal Trajectory Attention)와 움직임-의미론 분리 임베딩(Motion-Semantic Decoupled Embedding)을 사용하여 움직임을 우선시하면서 의미론적 지원을 통합합니다. 다양한 데이터셋에 대한 광범위한 테스트를 통해 최첨단 성능을 입증했으며, 특히 도전적인 시나리오와 다중 객체의 세밀한 분할에서 뛰어난 성과를 보였습니다. 우리의 코드는 https://motion-seg.github.io/에서 확인할 수 있습니다.

English

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

비디오에서 모든 움직임 분할하기

Segment Any Motion in Videos

초록

Support