一令牌分割全境：视频中的语言引导推理分割

摘要

我们介绍了VideoLISA，这是一个基于视频的多模态大型语言模型，旨在解决视频中基于语言指导的推理分割问题。利用大型语言模型的推理能力和世界知识，并借助“Segment Anything Model”的增强，VideoLISA根据语言指令在视频中生成时间上连贯的分割蒙版。现有基于图像的方法（如LISA）在处理视频任务时面临困难，因为视频具有额外的时间维度，需要对时间动态进行理解，并在帧间实现一致的分割。VideoLISA通过将“Sparse Dense Sampling”策略整合到视频-LLM中来解决这些挑战，该策略在计算约束内平衡了时间上下文和空间细节。此外，我们提出了一种使用特殊设计的<TRK>标记的“One-Token-Seg-All”方法，使模型能够跨多个帧分割和跟踪对象。对包括我们新引入的ReasonVOS基准在内的多样化基准进行了广泛评估，结果表明VideoLISA在涉及复杂推理、时间理解和对象跟踪的视频对象分割任务中表现出优越性能。虽然针对视频进行了优化，但VideoLISA在图像分割方面也显示出有希望的泛化能力，揭示了其作为语言指导对象分割的统一基础模型的潜力。代码和模型将在以下网址提供：https://github.com/showlab/VideoLISA。

English

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: https://github.com/showlab/VideoLISA.

一令牌分割全境：视频中的语言引导推理分割

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

摘要

Support