Video-MME-v2:迈向全面视频理解基准的新阶段
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
April 6, 2026
作者: Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
随着视频理解技术的快速发展,现有基准测试逐渐趋于饱和,暴露出排行榜分数虚高与模型实际能力之间的显著差距。为弥合这一鸿沟,我们推出Video-MME-v2——一个旨在严格评估视频理解鲁棒性与可信度的综合性基准。为系统化评估模型能力,我们设计了渐进式三级评估体系,逐级提升视频理解复杂度:从多点视觉信息聚合,到时序动态建模,最终延伸至复杂多模态推理。与传统每题准确率评估不同,我们提出基于分组的非线性评估策略,既强调相关查询间的一致性,又关注多步推理的连贯性。该策略将惩罚碎片化或猜测性正确回答,仅对具备有效推理支撑的答案给予认可。为确保数据质量,Video-MME-v2通过严格受控的人工标注流程构建,动员12名标注员与50名独立评审员,投入3300人工时并经过多达5轮质量审核,力求成为最具权威性的视频基准之一。大量实验表明,当前最佳模型Gemini-3-Pro与人类专家存在显著差距,并揭示出清晰的层级瓶颈:视觉信息聚合与时序建模的误差会传导至高层推理环节。我们还发现思维型推理高度依赖文本线索,字幕能提升性能但在纯视觉场景下可能适得其反。通过暴露这些局限,Video-MME-v2为下一代视频多模态大模型的研发建立了严苛的新测试标准。
English
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.