SAM2S:通过语义长期追踪实现手术视频中的任意分割
SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
November 20, 2025
作者: Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin
cs.AI
摘要
手术视频分割对于计算机辅助手术至关重要,能够实现手术器械和组织的精确定位与追踪。基于提示的交互式视频对象分割(iVOS)模型(如Segment Anything Model 2/SAM2)相比预设类别方法具有更高灵活性,但在手术场景下面临领域差异和长时追踪能力不足的挑战。为突破这些限制,我们构建了SA-SV——目前规模最大的手术iVOS基准数据集,包含跨越八种手术类型的实例级时空标注(61k帧,1.6k个掩码片段),支持长时追踪与零样本泛化能力的全面开发与评估。基于该数据集,我们提出SAM2S基础模型,通过三项创新增强SAM2的手术iVOS性能:(1)DiveMem可训练多样性记忆机制,实现鲁棒长时追踪;(2)面向器械理解的时序语义学习;(3)抗模糊学习策略以缓解多源数据集标注不一致问题。大量实验表明,在SA-SV上微调可使SAM2获得12.99平均J&F值的显著提升。SAM2S进一步将性能推至80.42平均J&F值,较原始版与微调版SAM2分别领先17.10和4.11个点,同时保持68 FPS实时推理速度及强大的零样本泛化能力。代码与数据集将发布于https://jinlab-imvr.github.io/SAM2S。
English
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average J\&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J\&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.