Video-MME-v2:邁向全面影片理解基準測試的新階段
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
April 6, 2026
作者: Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
隨著影片理解技術的快速發展,現有基準測試逐漸趨於飽和,暴露出排行榜分數虛高與模型實際能力之間的嚴重脫節。為彌合這一日益擴大的差距,我們推出Video-MME-v2——一個專為嚴謹評估影片理解魯棒性與可信度而設計的綜合基準。為系統化評估模型能力,我們設計了漸進式三層架構,逐級提升影片理解複雜度:從多點視覺資訊聚合,到時序動態建模,最終實現複雜多模態推理。有別於傳統的逐題準確率評估,我們提出基於群組的非線性評估策略,要求相關查詢間保持一致性,並確保多步推理的連貫性。該策略會懲罰碎片化或依賴猜測的正確答案,僅對具備有效推理支持的答案給予評分。為保證數據質量,Video-MME-v2通過嚴格受控的人工標註流程構建,動用12名標註員與50名獨立審核員,投入3,300工時並經過最多5輪質量檢驗,旨在成為最具權威性的影片基準之一。大量實驗表明,當前最佳模型Gemini-3-Pro與人類專家存在顯著差距,並揭示出清晰的層級瓶頸:視覺資訊聚合與時序建模的錯誤會傳導至高端推理環節。我們進一步發現,基於思維的推理高度依賴文本線索,雖能通過字幕提升性能,但在純視覺場景中有時反而導致表現下降。通過暴露這些侷限性,Video-MME-v2為下一代影片多模態大語言模型的發展建立了嚴苛的新測試平台。
English
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.