OmniVideoBench:面向全能多模态大语言模型的音视频理解评估
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
October 12, 2025
作者: Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu
cs.AI
摘要
近期,多模态大语言模型(MLLMs)在视频理解领域展现出显著潜力。然而,现有基准测试未能全面评估跨音频与视觉模态的协同推理能力,往往忽视其中一种模态或以逻辑不一致的方式整合两者。为填补这一空白,我们推出了OmniVideoBench,这是一个大规模且精心设计的基准测试,专门用于评估协同的视听理解能力,特别强调模态互补性与逻辑一致性。具体而言,OmniVideoBench包含1000个高质量问答对,每个问答对均附有逐步推理轨迹,源自628段时长从几秒到30分钟不等的多样化视频,并经过人工验证以确保完全正确与唯一性。此外,OmniVideoBench涵盖了13种精心设计的问题类型,包括时序推理、空间定位、计数、因果推断、摘要等,从而捕捉视频理解的核心挑战。在OmniVideoBench上对多个MLLMs的评估显示,模型表现与人类推理之间存在显著差距,开源模型明显落后于闭源模型,这凸显了真正视听推理的内在难度。我们将发布OmniVideoBench,以促进具备更强且更通用推理能力的MLLMs的发展。
English
Recent advances in multimodal large language models (MLLMs) have demonstrated
substantial potential in video understanding. However, existing benchmarks fail
to comprehensively evaluate synergistic reasoning capabilities across audio and
visual modalities, often neglecting either one of the modalities or integrating
them in a logically inconsistent manner. To bridge this gap, we introduce
OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to
assessing synergistic audio-visual understanding, with a strong emphasis on
modality complementarity and logical consistency. Specifically, OmniVideoBench
comprises 1000 high-quality question-answer(QA) pairs, each annotated with
step-by-step reasoning traces, derived from 628 diverse videos ranging from
several seconds to 30 minutes, and manually verified to guarantee complete
correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully
designed question types, covering temporal reasoning, spatial localization,
counting, causal inference, summarization, and beyond, thereby capturing the
essential challenges of video understanding. Evaluation of multiple MLLMs on
OmniVideoBench reveals a pronounced gap between model performance and human
reasoning, with open-source models lagging significantly behind their
closed-source counterparts, underscoring the inherent difficulty of genuine
audio-visual reasoning. We will release OmniVideoBench to foster the
development of MLLMs with stronger and more generalizable reasoning
capabilities.