X2SAM：图像与视频的通用分割技术

摘要

多模态大语言模型（MLLMs）已展现出强大的图像级视觉理解与推理能力，但其在图像和视频中的像素级感知仍存在局限。以SAM系列为代表的基础分割模型虽能生成高质量掩码，但依赖低层级视觉提示，无法原生解析复杂对话指令。现有分割型MLLMs虽缩小了这一差距，但通常专攻图像或视频单一模态，鲜有能在同一界面中同时支持文本与视觉提示。我们提出X2SAM——一种统一的分割MLLM，将通用分割能力从图像扩展至视频。该模型通过对话指令与视觉提示，将大语言模型与掩码记忆模块相结合，该模块存储引导视觉特征以实现时序一致的视频掩码生成。同一架构支持图像和视频输入下的通用分割、开放词汇分割、指代分割、推理分割、接地对话生成、交互式分割及视觉接地分割任务。我们进一步提出视频视觉接地（V-VGD）分割基准，用于评估模型能否根据交互式视觉提示分割视频中的目标轨迹。通过采用异构图像与视频数据集进行统一联合训练，X2SAM在实现强劲视频分割性能的同时，保持图像分割基准的竞争力，并保留通用的图像与视频对话能力。

English

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.