Video-MME-v2：迈向全面视频理解基准的新阶段

摘要

随着视频理解技术的快速发展，现有基准测试逐渐趋于饱和，暴露出排行榜分数虚高与模型实际能力之间的显著差距。为弥合这一鸿沟，我们推出Video-MME-v2——一个旨在严格评估视频理解鲁棒性与可信度的综合性基准。为系统化评估模型能力，我们设计了渐进式三级评估体系，逐级提升视频理解复杂度：从多点视觉信息聚合，到时序动态建模，最终延伸至复杂多模态推理。与传统每题准确率评估不同，我们提出基于分组的非线性评估策略，既强调相关查询间的一致性，又关注多步推理的连贯性。该策略将惩罚碎片化或猜测性正确回答，仅对具备有效推理支撑的答案给予认可。为确保数据质量，Video-MME-v2通过严格受控的人工标注流程构建，动员12名标注员与50名独立评审员，投入3300人工时并经过多达5轮质量审核，力求成为最具权威性的视频基准之一。大量实验表明，当前最佳模型Gemini-3-Pro与人类专家存在显著差距，并揭示出清晰的层级瓶颈：视觉信息聚合与时序建模的误差会传导至高层推理环节。我们还发现思维型推理高度依赖文本线索，字幕能提升性能但在纯视觉场景下可能适得其反。通过暴露这些局限，Video-MME-v2为下一代视频多模态大模型的研发建立了严苛的新测试标准。

English

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Video-MME-v2：迈向全面视频理解基准的新阶段

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

摘要

Support