ChatPaper.aiChatPaper

MeViS:面向指代性运动表达视频分割的多模态数据集

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

December 11, 2025
作者: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
cs.AI

摘要

本文提出了一种大规模多模态参照运动表达视频分割数据集,聚焦于基于物体运动语言描述的视频目标分割与追踪任务。现有参照视频分割数据集通常关注显著物体,且使用的语言表达富含静态属性特征,可能导致目标物体在单帧画面中即可被识别。此类数据集未能充分强调运动信息在视频与语言中的关键作用。为探索利用运动表达与运动推理线索实现像素级视频理解的可行性,我们推出了MeViS数据集,其中包含33,072条人工标注的文本与音频双模态运动表达,覆盖2,006个复杂场景视频中的8,171个目标物体。我们在MeViS支持的4项任务上对15种现有方法进行基准测试,包括6种参照视频目标分割(RVOS)方法、3种音频引导视频目标分割(AVOS)方法、2种参照多目标追踪(RMOT)方法,以及针对新引入的参照运动表达生成(RMEG)任务的4种视频描述方法。实验结果表明现有方法在处理运动表达引导的视频理解任务时存在明显不足。我们进一步分析技术挑战并提出LMPM++方法,在RVOS/AVOS/RMOT任务中取得了最先进的性能。本数据集为开发复杂视频场景中运动表达引导的视频理解算法提供了平台。MeViS数据集与相关源代码已公开于https://henghuiding.com/MeViS/。
English
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/
PDF01December 18, 2025