움직임, 기하학 및 의미 적응을 통한 복잡한 비선형 시각 객체 추적을 위한 Segment Anything

초록

전통적인 시각 객체 추적(VOT) 방법은 일반적으로 작업별 지도 학습(supervised training)에 의존하기 때문에, 탐지되지 않은 객체와 방해 요소, 폐색, 비선형 운동이 포함된 까다로운 시나리오에 대한 일반화 능력이 제한적이다. 최근의 SAM 2로 대표되는 비전 기초 모델(vision foundation model)은 대규모 사전 학습을 통해 강력한 비디오 이해 사전 지식을 습득하며, 보다 강건하고 일반화 가능한 추적기를 구축할 수 있는 유망한 기반을 제공한다. 그러나 SAM 2를 VOT에 직접 적용하는 것은 여전히 최적이 아니다. 왜냐하면 SAM 2는 대상의 운동 역학(motion dynamics)을 명시적으로 모델링하지 않으며, 신뢰할 수 있는 추적에 필수적인 프레임 간 기하학적 및 의미론적 일관성(geometric and semantic consistency)을 강제하지 않기 때문이다. 이 문제를 해결하기 위해, 우리는 SAMOSA라는 새로운 추적 프레임워크를 제안한다. 이 프레임워크는 운동, 기하학 및 의미론적 단서를 명시적으로 활용하여 SAM 2를 복잡한 VOT 시나리오에 적응시킨다. 구체적으로, 우리는 대상 역학을 모델링하고 마스크 선택 및 메모리 필터링을 안내하기 위해 경량화된 비선형 운동 예측기를 도입한다. 또한 의미론적 단서를 활용하여 대상 이동을 탐지하고 추적 실패로부터 복구하며, 기하학적 단서는 구조적 제약 조건으로 통합하여 추적 안정성을 향상시킨다. 이러한 방식으로 SAMOSA는 SAM 2의 암시적 비디오 이해 사전 지식과 명시적 추적 지향 모델링 간의 격차를 해소한다. 광범위한 실험 결과, SAMOSA는 일반 벤치마크에서 최첨단 SAM 2 기반 접근법보다 일관되게 우수한 성능을 보이며, 지도 학습 VOT 방법보다 더 강력한 일반화 능력을 입증하고, 복잡한 비선형 운동 시나리오의 전형인 안티-UAV 데이터셋에서 상당한 성능 향상을 달성함을 보여준다. 우리의 코드는 https://github.com/DurYi/SAMOSA에서 확인할 수 있다.

English

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.