MMR-V: 語られざるものは何か？映像におけるマルチモーダル深層推論のベンチマーク

要旨

ビデオの連続的な構造は、マルチモーダル大規模言語モデル（MLLMs）が複数フレームの証拠を特定し、マルチモーダル推論を行う能力に対して課題を提起します。しかし、既存のビデオベンチマークは主に理解タスクに焦点を当てており、モデルが質問で言及されたフレーム（以下「質問フレーム」と呼ぶ）とその周辺の少数のフレームをマッチングさせることのみを要求しています。このギャップを埋めるため、我々はMMR-V: A Benchmark for Multimodal Deep Reasoning in Videosを提案します。このベンチマークは以下の特徴を持ちます。(1) 長距離・複数フレーム推論: モデルは、質問フレームから遠く離れた証拠フレームを推論・分析する必要があります。(2) 知覚を超えた推論: 質問は、直接的な知覚だけでは答えられず、隠された情報に対する推論を必要とします。(3) 信頼性: すべてのタスクは手動でアノテーションされ、現実世界のユーザー理解を参照して一般的な認識と整合性を保ちます。(4) 混乱性: モデルのショートカットを減らすために慎重に設計されたディストラクターアノテーション戦略。MMR-Vは317のビデオと1,257のタスクで構成されています。実験結果から、現在のモデルはマルチモーダル推論に依然として苦戦しており、最高性能のモデルであるo4-miniでさえ52.5%の精度しか達成できません。さらに、現在の推論強化戦略（Chain-of-Thoughtおよびスケーリングテストタイムコンピュート）は限定的な改善しかもたらしません。さらなる分析から、マルチモーダル推論に必要なCoTはテキスト推論におけるそれとは異なることが示唆され、これが性能向上の限定的な理由の一部を説明しています。MMR-Vがマルチモーダル推論能力の向上に向けたさらなる研究を刺激することを期待しています。

English

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

MMR-V: 語られざるものは何か？映像におけるマルチモーダル深層推論のベンチマーク

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

要旨

Support