Video-MME-v2: 包括的ビデオ理解のためのベンチマークにおける新たな段階へ

要旨

ビデオ理解技術の急速な進展に伴い、既存のベンチマークは飽和状態が進み、過剰に評価されたリーダーボードのスコアと実世界でのモデル能力との間に深刻な乖離が生じています。この拡大するギャップに対処するため、我々はビデオ理解の堅牢性と信頼性を厳密に評価する包括的ベンチマーク「Video-MME-v2」を提案します。モデル能力を体系的に評価するため、マルチポイントの視覚情報統合から時間的ダイナミクスのモデル化、さらには複雑なマルチモーダル推論に至るまで、ビデオ理解の複雑性を段階的に高めるプログレッシブな3層階層を設計しました。さらに、従来の質問単位の正答率とは対照的に、関連するクエリ間の一貫性と多段階推論の首尾一貫性を重視したグループベースの非線形評価戦略を提案します。これは断片的または推測に基づく正解をペナルティ化し、有効な推論によって支持される回答のみを評価します。データ品質を保証するため、Video-MME-v2は12名のアノテーターと50名の独立したレビュアーを巻き込んだ厳密に管理された人手注釈パイプラインを通じて構築されました。3,300人時の投入と最大5段階の品質保証を背景に、Video-MME-v2は最も権威あるビデオベンチマークの一つとなることを目指しています。大規模な実験により、現時点で最高性能のモデルGemini-3-Proと人間の専門家との間には依然として大きな隔たりが存在し、視覚情報統合や時間的モデリングの誤りが高次推論を制限する明確な階層的ボトルネックが明らかになりました。さらに、思考ベースの推論がテキスト手がかりに強く依存しており、字幕付きでは性能が向上するものの、純粋な視覚環境ではかえって性能が低下する場合があることを発見しました。これらの限界を可視化することで、Video-MME-v2は次世代ビデオMLLM開発に向けた要求の厳しい新たなテストベッドを確立します。

English

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Video-MME-v2: 包括的ビデオ理解のためのベンチマークにおける新たな段階へ

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

要旨

Support