結合運動、幾何與語義適應的分割一切：應用於複雜非線性視覺物體追蹤

摘要

傳統視覺目標追蹤（VOT）方法通常依賴於任務特定的監督式訓練，限制了其在未見過目標及具有干擾物、遮蔽與非線性運動等挑戰場景中的泛化能力。近期以 SAM 2 為代表的視覺基礎模型，透過大規模預訓練學習到強大的影片理解先驗知識，為建構更穩健且具泛化能力的追蹤器提供了有前景的基礎。然而，直接將 SAM 2 應用於 VOT 仍未達到最佳效果，因為它並未明確建模目標的運動動態，也未能強制跨影格間的幾何與語意一致性——這兩者對於可靠的追蹤至關重要。為了解決此問題，我們提出 SAMOSA，一個新的追蹤框架，透過明確利用運動、幾何與語意線索，將 SAM 2 適應於複雜的 VOT 場景。具體來說，我們引入一個輕量級的非線性運動預測器來建模目標動態，並引導遮罩選取及記憶體過濾。我們進一步利用語意線索來偵測目標位移並從追蹤失敗中恢復，同時將幾何線索作為結構約束條件融入，以提升追蹤穩定性。透過這種方式，SAMOSA 彌補了 SAM 2 隱含的影片理解先驗與明確的追蹤導向建模之間的差距。大量實驗顯示，SAMOSA 在通用基準測試上持續優於基於 SAM 2 的最新方法，展現出比監督式 VOT 方法更強的泛化能力，並在典型的複雜非線性運動場景——反無人機資料集上取得顯著增益。我們的程式碼已開源於 https://github.com/DurYi/SAMOSA。

English

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.