Video-MME-v2：邁向全面影片理解基準測試的新階段

摘要

隨著影片理解技術的快速發展，現有基準測試逐漸趨於飽和，暴露出排行榜分數虛高與模型實際能力之間的嚴重脫節。為彌合這一日益擴大的差距，我們推出Video-MME-v2——一個專為嚴謹評估影片理解魯棒性與可信度而設計的綜合基準。為系統化評估模型能力，我們設計了漸進式三層架構，逐級提升影片理解複雜度：從多點視覺資訊聚合，到時序動態建模，最終實現複雜多模態推理。有別於傳統的逐題準確率評估，我們提出基於群組的非線性評估策略，要求相關查詢間保持一致性，並確保多步推理的連貫性。該策略會懲罰碎片化或依賴猜測的正確答案，僅對具備有效推理支持的答案給予評分。為保證數據質量，Video-MME-v2通過嚴格受控的人工標註流程構建，動用12名標註員與50名獨立審核員，投入3,300工時並經過最多5輪質量檢驗，旨在成為最具權威性的影片基準之一。大量實驗表明，當前最佳模型Gemini-3-Pro與人類專家存在顯著差距，並揭示出清晰的層級瓶頸：視覺資訊聚合與時序建模的錯誤會傳導至高端推理環節。我們進一步發現，基於思維的推理高度依賴文本線索，雖能通過字幕提升性能，但在純視覺場景中有時反而導致表現下降。通過暴露這些侷限性，Video-MME-v2為下一代影片多模態大語言模型的發展建立了嚴苛的新測試平台。

English

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Video-MME-v2：邁向全面影片理解基準測試的新階段

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

摘要

Support