MotionSight：增强多模态大语言模型中的细粒度运动理解能力

摘要

尽管多模态大语言模型（MLLMs）取得了进展，其在细粒度视频运动理解方面的能力仍存在显著局限。这些模型往往缺乏帧间差异分析，倾向于平均或忽略细微的视觉线索。此外，虽然视觉提示在静态图像中展现了潜力，但其在视频时间复杂性中的应用，尤其是针对细粒度运动理解，仍大多未被探索。我们探究是否能够解锁内在能力，以增强MLLMs的运动感知，并生成独特的视觉特征，用于解耦物体与相机运动线索。在本研究中，我们提出了MotionSight，一种创新的零样本方法，率先采用以物体为中心的视觉聚焦和运动模糊作为视觉提示，无需训练即可有效提升细粒度运动理解。为将其转化为宝贵的数据资产，我们构建了MotionVid-QA，这是首个面向细粒度视频运动理解的大规模数据集，包含层次化标注，如SFT和偏好数据，约40K个视频片段及87K个问答对。实验表明，MotionSight在开源性能上达到顶尖水平，并与商业模型相媲美。特别是在细粒度运动理解方面，我们提出了一种新颖的零样本技术及一个大规模、高质量的数据集。所有代码与标注将公开提供。

English

Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.

MotionSight：增强多模态大语言模型中的细粒度运动理解能力

MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

摘要

Support