動き、幾何学、および意味的適応を用いたSegment Anythingによる複雑な非線形視覚物体追跡

要旨

従来のビジュアルオブジェクトトラッキング（VOT）手法は通常、タスク固有の教師あり学習に依存しており、未知の物体や、妨害物体、遮蔽、非線形運動を含む困難なシナリオへの汎化が制限されています。最近のビジョン基盤モデル、例えばSAM 2は、大規模な事前学習から強力なビデオ理解の事前知識を学習し、より堅牢で汎化可能なトラッカーを構築するための有望な基盤を提供します。しかしながら、SAM 2をVOTに直接適用することは依然として最適とは言えません。なぜなら、SAM 2はターゲットの運動ダイナミクスを明示的にモデル化せず、フレーム間の幾何学的および意味的一貫性を強制しないからです。これらは信頼性の高い追跡に不可欠です。この問題に対処するため、我々はSAMOSAを提案します。これは、運動、幾何学、および意味的手がかりを明示的に活用することにより、SAM 2を複雑なVOTシナリオに適応させる新しいトラッキングフレームワークです。具体的には、軽量な非線形運動予測器を導入し、ターゲットのダイナミクスをモデル化し、マスク選択およびメモリフィルタリングをガイドします。さらに、意味的手がかりを活用してターゲットのずれを検出し、追跡失敗から回復します。一方、幾何学的な手がかりは構造的制約として組み込まれ、追跡の安定性を向上させます。このようにして、SAMOSAはSAM 2の暗黙的なビデオ理解の事前知識と、明示的な追跡指向のモデリングとの間のギャップを埋めます。広範な実験により、SAMOSAは一般的なベンチマークにおいて最先端のSAM 2ベースの手法を一貫して上回り、教師ありVOT手法よりも強い汎化を示し、複雑な非線形運動シナリオを代表する対UAVデータセットで大幅な改善を達成することを示しています。コードはhttps://github.com/DurYi/SAMOSAで公開されています。

English

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.