DC-SAM：デュアル・コンシステンシーによる画像および動画のインコンテキストセグメンテーション

要旨

単一のラベル付き例が与えられた場合、インコンテキストセグメンテーションは対応するオブジェクトをセグメント化することを目指します。この設定は、Few-shot学習におけるワンショットセグメンテーションとして知られており、セグメンテーションモデルの汎化能力を探求し、シーン理解や画像/動画編集など、さまざまな視覚タスクに適用されてきました。最近のSegment Anything Models（SAM）はインタラクティブセグメンテーションにおいて最先端の結果を達成していますが、これらのアプローチはインコンテキストセグメンテーションに直接適用することはできません。本研究では、プロンプトチューニングに基づくDual Consistency SAM（DC-SAM）メソッドを提案し、SAMおよびSAM2を画像と動画のインコンテキストセグメンテーションに適応させます。私たちの重要な洞察は、高品質な視覚プロンプトを提供することで、SAMのプロンプトエンコーダの特徴をセグメンテーションにおいて強化することです。マスク事前生成時には、SAMの特徴を融合させてプロンプトエンコーダをより適切に調整します。次に、融合された特徴と初期視覚プロンプトに対して、サイクル一貫性のあるクロスアテンションを設計します。さらに、プロンプトエンコーダにおいて識別的なポジティブおよびネガティブプロンプトを使用することで、デュアルブランチ設計を提供します。さらに、提案されたデュアル一貫性メソッドをマスクチューブに適用するためのシンプルなマスクチューブトレーニング戦略を設計します。提案されたDC-SAMは主に画像向けに設計されていますが、SAM2のサポートにより、動画領域にシームレスに拡張することができます。動画領域におけるインコンテキストセグメンテーションの欠如を考慮し、既存の動画セグメンテーションデータセットから最初のベンチマークを手動でキュレーションし、In-Context Video Object Segmentation（IC-VOS）として構築し、モデルのインコンテキスト能力をより適切に評価します。大規模な実験により、私たちのメソッドがCOCO-20iで55.5（+1.4）mIoU、PASCAL-5iで73.0（+1.1）mIoU、提案されたIC-VOSベンチマークで71.52のJ&Fスコアを達成することが示されました。私たちのソースコードとベンチマークはhttps://github.com/zaplm/DC-SAMで利用可能です。

English

Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.

DC-SAM：デュアル・コンシステンシーによる画像および動画のインコンテキストセグメンテーション

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

要旨

Support