ChatPaper.aiChatPaper

MMR-V:未言之處為何?影片多模態深度推理的基準測試

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

June 4, 2025
作者: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
cs.AI

摘要

影片的序列結構對多模態大型語言模型(MLLMs)定位多幀證據並進行多模態推理的能力提出了挑戰。然而,現有的影片基準測試主要集中於理解任務,這些任務僅要求模型匹配問題中提到的幀(以下簡稱「問題幀」)並感知少量相鄰幀。為填補這一空白,我們提出了MMR-V:影片多模態深度推理基準測試。該基準測試具有以下特點:(1) 長距離、多幀推理:模型需要推斷和分析可能遠離問題幀的證據幀。(2) 超越感知:問題無法僅通過直接感知來回答,而是需要對隱藏信息進行推理。(3) 可靠性:所有任務均經過人工註釋,參考了大量現實世界用戶的理解,以符合普遍認知。(4) 混淆性:精心設計的干擾項註釋策略,以減少模型的捷徑。MMR-V包含317部影片和1,257個任務。我們的實驗顯示,當前模型在多模態推理方面仍存在困難;即使表現最佳的模型o4-mini,其準確率也僅為52.5%。此外,當前的推理增強策略(思維鏈和擴展測試時計算)帶來的增益有限。進一步分析表明,多模態推理所需的思維鏈與文本推理中的思維鏈有所不同,這部分解釋了性能增益有限的原因。我們希望MMR-V能激發更多關於增強多模態推理能力的研究。
English
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
PDF292June 5, 2025