ChatPaper.aiChatPaper

VideoZeroBench:基于时空证据验证的视频多模态大模型能力边界探究

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

April 2, 2026
作者: Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan
cs.AI

摘要

近期视频多模态大模型在各类基准测试中表现卓越,但现有评估存在两大关键缺陷:(1)虚高的评分可能掩盖细粒度视觉理解与推理能力的不足;(2)答案正确性的衡量往往未验证模型是否识别出支撑其预测的精确时空证据。为此,我们提出VideoZeroBench——一个针对挑战性长视频问答任务设计的层次化基准,可严格验证时空证据。该基准包含13个领域的500个人工标注问题,每个问题均配有作为证据的时间区间和空间边界框。为区分答案生成、时间定位与空间定位能力,我们引入五级评估协议,逐级收紧证据要求。实验表明,即使在标准端到端问答设置(第三级)下,Gemini-3-Pro的正确答题率也不足17%。当施加定位约束时,模型性能急剧下降:在要求同时具备正确答案和精确时空定位的第五级评估中,所有模型准确率均未超过1%,多数模型甚至无法实现任何正确的基础定位预测。这些结果揭示了表层答案正确性与真正基于证据的推理之间存在显著差距,表明基础视频理解仍是长视频问答的瓶颈。我们进一步从最小证据跨度、原子能力维度及推理范式等角度展开分析,为未来基础视频推理研究提供洞见。本基准与代码将公开共享。
English
Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.
PDF51April 4, 2026