VideoZeroBench: 時空間的証拠検証によるビデオMLLMの限界の探求

要旨

近年、ビデオマルチモーダル大規模言語モデルは、様々なベンチマークで印象的な成果を上げている。しかし、現在の評価には2つの重大な限界がある：（1）スコアの過大評価が、細粒度の視覚的理解と推論における欠陥を隠蔽する可能性、（2）回答の正しさが、モデルが予測を支持する正確な時空間的証拠を特定しているかどうかを検証せずに測定されることが多い。これに対処するため、我々は挑戦的な長尺ビデオ質問応答向けに設計され、時空間的証拠を厳密に検証する階層的ベンチマーク「VideoZeroBench」を提案する。これは13のドメインにわたる500の手動注釈付き質問から構成され、証拠として時間区間と空間的バウンディングボックスがペアとなっている。回答生成、時間的グラウンディング、空間的グラウンディングを分離するため、証拠要件を段階的に厳格化する5段階の評価プロトコルを導入する。実験結果によると、Gemini-3-Proでさえ、標準的なエンドツーエンドQA設定（レベル3）では質問の17%未満しか正しく回答しない。グラウンディング制約が課されると、性能は急激に低下する：正しい回答と正確な時空間的ローカライゼーションの両方が要求される場合（レベル5）、どのモデルも1%の精度を超えず、大半は正しくグラウンディングされた予測を一切達成できない。これらの結果は、表面的な回答の正しさと真の証拠に基づく推論との間に大きな隔たりがあることを露呈し、グラウンディングされたビデオ理解が長尺ビデオQAにおけるボトルネックであることを明らかにする。さらに、最小証拠スパン、原子的能力、推論パラダイムにわたる性能を分析し、グラウンディングされたビデオ推論の将来研究への示唆を提供する。ベンチマークとコードは公開予定である。

English

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

VideoZeroBench: 時空間的証拠検証によるビデオMLLMの限界の探求

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

要旨

Support