拆解视频大语言模型基准：知识、空间感知，还是真正的时间理解？

摘要

现有的视频理解基准测试常将基于知识的提问与纯图像层面的问题混为一谈，而非明确分离模型的时间推理能力——这是视频理解区别于其他模态的关键所在。我们识别出两大局限，它们模糊了高分是否真正意味着对视频动态内容更深层次的理解：（1）强大的语言先验，即模型无需观看视频即可回答问题；（2）顺序不变性，即即便视频帧在时间上被打乱，模型在某些问题上仍能保持相近的表现。为缓解这些问题，我们提出了VBenchComp，一个自动化流程，它将问题分类至不同领域：LLM可答型、语义型及时间型。具体而言，LLM可答型问题无需观看视频即可解答；语义型问题即使视频帧被打乱仍可回答；而时间型问题则要求理解帧的正确时序。其余问题被标记为其他类。这一方法能够实现对视频大语言模型不同能力的细粒度评估。我们的分析揭示了传统总分所掩盖的模型微妙弱点，并为设计能更精准评估视频大语言模型的未来基准测试提供了洞见与建议。

English

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

拆解视频大语言模型基准：知识、空间感知，还是真正的时间理解？

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

摘要

Support