DC-SAM：通过双重一致性实现图像与视频中的上下文分割

摘要

给定单个标注样本，上下文分割旨在分割对应的目标物体。这一设定在少样本学习中被称为一次性分割，它探索了分割模型的泛化能力，并已应用于多种视觉任务，包括场景理解与图像/视频编辑。尽管最近的“分割一切模型”（Segment Anything Models，SAM）在交互式分割中取得了最先进的成果，但这些方法并不直接适用于上下文分割。在本研究中，我们提出了基于提示调优的双一致性SAM（Dual Consistency SAM，DC-SAM）方法，以适配SAM及SAM2进行图像和视频的上下文分割。我们的核心洞见在于通过提供高质量的视觉提示来增强SAM提示编码器在分割中的特征表现。在生成掩码先验时，我们融合SAM特征以更好地对齐提示编码器。随后，我们在融合特征与初始视觉提示上设计了循环一致性交叉注意力机制。接着，通过使用提示编码器中的判别性正负提示，我们提供了双分支设计。此外，我们设计了一种简单的掩码管训练策略，将所提出的双一致性方法应用于掩码管。尽管DC-SAM主要针对图像设计，但在SAM2的支持下，它能无缝扩展至视频领域。鉴于视频领域缺乏上下文分割基准，我们手动整理并构建了首个基于现有视频分割数据集的基准，命名为“上下文视频目标分割”（In-Context Video Object Segmentation，IC-VOS），以更好地评估模型的上下文能力。大量实验表明，我们的方法在COCO-20i上达到了55.5（+1.4）的mIoU，在PASCAL-5i上达到了73.0（+1.1）的mIoU，并在提出的IC-VOS基准上获得了71.52的J&F分数。我们的源代码及基准数据集可在https://github.com/zaplm/DC-SAM获取。

English

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM：通过双重一致性实现图像与视频中的上下文分割

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

摘要

Support