X2SAM：画像と動画における任意のセグメンテーション

要旨

マルチモーダル大規模言語モデル（MLLM）は、画像レベルの視覚的理解と推論において優れた能力を示しているが、画像と動画の両方にわたるピクセルレベルの知覚は依然として限られている。SAMシリーズに代表される基盤的セグメンテーションモデルは高品質なマスクを生成するが、低次元の視覚プロンプトに依存しており、複雑な対話型指示を本来の形で解釈することができない。既存のセグメンテーションMLLMはこの隔たりを埋めるが、通常は画像または動画のいずれかに特化しており、テキストと視覚の両方のプロンプトを単一インターフェースでサポートすることは稀である。本研究では、任意のセグメンテーション能力を画像から動画へ拡張する統合型セグメンテーションMLLMであるX2SAMを提案する。対話型指示と視覚プロンプトが与えられると、X2SAMはLLMとマスクメモリモジュールを連動させ、時間的一貫性のある動画マスク生成のための誘導視覚特徴を保存する。同一の定式化により、画像と動画入力にわたる汎用・開放語彙・参照・推論・接地対話生成・インタラクティブ・視覚的接地セグメンテーションをサポートする。さらに、対話型視覚プロンプトから動画内のオブジェクトトラックをセグメント化できるかどうかを評価するVideo Visual Grounded（V-VGD）セグメンテーションベンチマークを導入する。異種の画像・動画データセットに対する統合的共同学習戦略により、X2SAMは強力な動画セグメンテーション性能を発揮し、画像セグメンテーションベンチマークでも競争力を維持し、一般的な画像・動画チャット機能を保持する。

English

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

X2SAM：画像と動画における任意のセグメンテーション

X2SAM: Any Segmentation in Images and Videos

要旨

Support