VideoReasonBench：多模态大语言模型能否执行以视觉为核心的复杂视频推理？

摘要

近期研究表明，长链思维推理（CoT）能显著提升大型语言模型（LLMs）在复杂任务上的表现。然而，这一优势尚未在视频理解领域得到验证，因为现有的大多数基准测试缺乏展示扩展CoT链优势所需的推理深度。尽管近期有研究提出了旨在视频推理的基准测试，但这些任务往往以知识驱动，并不高度依赖视觉内容。为填补这一空白，我们引入了VideoReasonBench，一个专为评估以视觉为中心、复杂视频推理而设计的基准测试。为确保视觉丰富性和高推理复杂度，VideoReasonBench中的每段视频都展示了对仅在视频部分可见的潜在状态进行的一系列细粒度操作。问题评估了三个递进的视频推理技能层级：回忆观察到的视觉信息、推断潜在状态内容以及预测视频之外的信息。在此任务设置下，模型必须精确回忆视频中的多个操作，并通过逐步推理来获得这些问题的正确答案。利用VideoReasonBench，我们全面评估了18个最先进的多模态LLMs（MLLMs），发现大多数在复杂视频推理上表现不佳，例如GPT-4o仅达到6.9%的准确率，而思维增强的Gemini-2.5-Pro以56.0%的准确率显著优于其他模型。我们对“测试时扩展”的进一步研究表明，在现有视频基准测试上几乎没有或仅有微小益处的扩展思维预算，对于提升VideoReasonBench上的性能至关重要。

English

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

VideoReasonBench：多模态大语言模型能否执行以视觉为核心的复杂视频推理？

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

摘要

Support