ChatPaper.aiChatPaper

MMR-V:未言之谜?视频多模态深度推理基准测试

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

June 4, 2025
作者: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
cs.AI

摘要

视频的序列化结构对多模态大语言模型(MLLMs)定位多帧证据并进行多模态推理的能力提出了挑战。然而,现有的视频基准主要集中于理解任务,这些任务仅要求模型匹配问题中提到的帧(以下简称“问题帧”)并感知少量相邻帧。为填补这一空白,我们提出了MMR-V:视频多模态深度推理基准。该基准具有以下特征:(1)长距离、多帧推理:要求模型推断和分析可能远离问题帧的证据帧。(2)超越感知:问题无法仅通过直接感知回答,而需对隐含信息进行推理。(3)可靠性:所有任务均经过人工标注,参考了大量现实世界用户的理解,以确保与普遍认知一致。(4)迷惑性:精心设计的干扰项标注策略,以减少模型走捷径的可能性。MMR-V包含317个视频和1,257个任务。我们的实验表明,当前模型在多模态推理方面仍存在困难;即使表现最佳的模型o4-mini,其准确率也仅为52.5%。此外,当前的推理增强策略(如思维链和扩展测试时计算)带来的提升有限。进一步分析表明,多模态推理所需的思维链与文本推理中的思维链存在差异,这在一定程度上解释了性能提升有限的原因。我们希望MMR-V能激发更多关于增强多模态推理能力的研究。
English
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
PDF282June 5, 2025