X2SAM：圖像與影片中的任意分割技術

摘要

多模態大型語言模型（MLLMs）已展現出強大的圖像級視覺理解與推理能力，但其在圖像和影片中的像素級感知能力仍存在侷限性。基礎分割模型（如SAM系列）能生成高品質遮罩，但依賴低階視覺提示且無法原生解析複雜的對話式指令。現有分割型MLLMs雖縮小了這一差距，但通常專注於圖像或影片單一領域，鮮少能在單一介面中同時支援文字與視覺提示。我們提出X2SAM——一個統一的分割MLLM，將任意分割能力從圖像擴展至影片。透過對話指令與視覺提示，X2SAM將大型語言模型與遮罩記憶模組耦合，該模組儲存引導視覺特徵以實現時間連貫的影片遮罩生成。此統一架構支援圖像與影片輸入的通用、開放詞彙、指代、推理、接地對話生成、互動式及視覺接地分割任務。我們進一步推出影片視覺接地（V-VGD）分割基準，用於評估模型能否根據互動式視覺提示分割影片中的物件軌跡。透過對異構圖像與影片資料集進行統一聯合訓練，X2SAM不僅在影片分割任務中表現強勁，於圖像分割基準保持競爭力，同時保留了通用的圖像與影片對話能力。

English

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

X2SAM：圖像與影片中的任意分割技術

X2SAM: Any Segmentation in Images and Videos

摘要

Support