MMR-V: 말하지 않은 것은 무엇인가? 비디오 내 다중 양식 심층 추론을 위한 벤치마크

초록

비디오의 순차적 구조는 다중모드 대형 언어 모델(MLLMs)이 다중 프레임 증거를 찾고 다중모드 추론을 수행하는 능력에 있어 도전 과제로 작용한다. 그러나 기존의 비디오 벤치마크는 주로 이해 과제에 초점을 맞추고 있으며, 이는 모델이 질문에서 언급된 프레임(이하 "질문 프레임")과 몇 개의 인접 프레임을 매칭하고 인지하는 것만을 요구한다. 이러한 격차를 해결하기 위해, 우리는 MMR-V: 비디오에서의 다중모드 심층 추론 벤치마크를 제안한다. 이 벤치마크는 다음과 같은 특징을 가진다. (1) 장거리, 다중 프레임 추론: 모델은 질문 프레임과 멀리 떨어진 증거 프레임을 추론하고 분석해야 한다. (2) 인지를 넘어선 추론: 질문은 직접적인 인지만으로는 답할 수 없으며 숨겨진 정보에 대한 추론이 필요하다. (3) 신뢰성: 모든 과제는 수동으로 주석 처리되었으며, 광범위한 실제 사용자 이해를 참조하여 일반적인 인식과 일치하도록 하였다. (4) 혼란 유도: 모델의 단축 경로를 줄이기 위해 신중하게 설계된 방해 요소 주석 전략을 적용하였다. MMR-V는 317개의 비디오와 1,257개의 과제로 구성된다. 우리의 실험 결과, 현재의 모델들은 여전히 다중모드 추론에 어려움을 겪고 있으며, 가장 성능이 좋은 모델인 o4-mini도 정확도가 52.5%에 불과하다. 또한, 현재의 추론 강화 전략(Chain-of-Thought 및 테스트 시간 계산 확장)은 제한된 성능 향상을 가져온다. 추가 분석에 따르면, 다중모드 추론에 요구되는 CoT는 텍스트 추론에서의 CoT와 다르며, 이는 제한된 성능 향상을 부분적으로 설명한다. 우리는 MMR-V가 다중모드 추론 능력을 향상시키기 위한 추가 연구에 영감을 줄 수 있기를 바란다.

English

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

MMR-V: 말하지 않은 것은 무엇인가? 비디오 내 다중 양식 심층 추론을 위한 벤치마크

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

초록

Support