MeViS:面向指代性运动表达视频分割的多模态数据集
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
December 11, 2025
作者: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
cs.AI
摘要
本文提出一个大规模多模态参照运动表达视频分割数据集,专注于根据物体运动语言描述实现视频中目标物体的分割与追踪。现有参照视频分割数据集通常聚焦显著物体,且使用富含静态属性的语言表达,可能导致目标物体在单帧中即可被识别。此类数据集对视频与语言中运动要素的重视不足。为探索利用运动表达与运动推理线索实现像素级视频理解的可行性,我们推出MeViS数据集,包含33,072条人工标注的文本与音频运动表达,涵盖2,006个复杂场景视频中8,171个物体。我们在MeViS支持的4项任务上对15种现有方法进行基准测试,包括6种参照视频目标分割方法、3种音频引导视频目标分割方法、2种参照多目标追踪方法,以及针对新提出的参照运动表达生成任务的4种视频描述方法。实验结果揭示了现有方法在处理运动表达引导视频理解时的缺陷与局限。我们进一步分析挑战并提出LMPM++方法,在参照视频目标分割/音频引导视频目标分割/参照多目标追踪任务上取得最新最优效果。本数据集为复杂视频场景中运动表达引导视频理解算法的开发提供了平台。MeViS数据集与相关源代码已公开于https://henghuiding.com/MeViS/。
English
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/