MMOU:面向长时复杂真实世界视频的大规模多任务全维度理解与推理基准
MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
March 14, 2026
作者: Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro, Mohammad Shoeybi, Wei Ping
cs.AI
摘要
多模态大语言模型(MLLMs)在独立评估视觉与听觉理解任务时已展现出强劲性能。然而,这些模型在长时复杂视频中处理全模态(视觉、听觉及文本)信号进行联合推理的能力仍待探索。我们推出MMOU基准测试,旨在系统评估模型在挑战性现实场景下的多模态理解与推理能力。该基准包含15,000道精心设计的问题,配以903段从网络采集的时长不一的视频,覆盖多元领域并呈现丰富的紧密耦合型音视频内容。基准测试涵盖13项基础技能类别,所有任务均需跨模态、跨时间整合证据。全部问题均由专业标注人员通过多轮次人工标注,确保高质量与推理保真度。我们对20余个开源及商业多模态模型进行MMOU评估,结果揭示显著性能差距:最优闭源模型准确率仅达64.2%,而最强开源模型仅为46.8%。研究结果凸显了长时全模态理解面临的挑战,表明当前模型在长视频中常无法运用基础技能。通过细化分析,我们进一步识别出系统性失误模式,为揭示现有模型的失效环节及原因提供洞见。
English
Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.