结合运动、几何与语义自适应的分割一切方法用于复杂非线性视觉目标跟踪

摘要

传统的视觉目标跟踪（VOT）方法通常依赖于特定任务的监督训练，这限制了其对未见物体的泛化能力以及在存在干扰物、遮挡和非线性运动的挑战场景下的表现。近年来，以 SAM 2 为代表的视觉基础模型，通过大规模预训练学习了强大的视频理解先验，为构建更鲁棒、更具泛化性的跟踪器提供了有前景的基础。然而，直接将 SAM 2 应用于 VOT 仍非最优，因为它既未显式建模目标的运动动力学，也未强制跨帧的几何与语义一致性——而这些对于可靠的跟踪至关重要。为解决这一问题，我们提出了 SAMOSA，一种新的跟踪框架，通过显式利用运动、几何和语义线索，将 SAM 2 适配到复杂的 VOT 场景中。具体而言，我们引入了一个轻量级非线性运动预测器来建模目标动态，并指导掩码选择及记忆过滤。我们进一步利用语义线索检测目标偏移并从跟踪失败中恢复，同时将几何线索作为结构约束融入，以提高跟踪稳定性。通过这种方式，SAMOSA 弥合了 SAM 2 隐含的视频理解先验与显式面向跟踪的建模之间的差距。大量实验表明，SAMOSA 在通用基准测试上始终优于最先进的基于 SAM 2 的方法，展现出比监督式 VOT 方法更强的泛化能力，并在典型非线性运动场景的反无人机数据集上取得了显著提升。我们的代码已开源在 https://github.com/DurYi/SAMOSA。

English

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.