VideoReasonBench：多模態大語言模型能否執行視覺中心的複雜視頻推理？

摘要

近期研究表明，長鏈思維（CoT）推理能顯著提升大型語言模型（LLMs）在複雜任務上的表現。然而，這一優勢尚未在視頻理解領域得到證實，因為現有的大多數基準測試缺乏展示延長CoT鏈優勢所需的推理深度。儘管近期有研究提出了旨在評估視頻推理的基準測試，但這些任務往往以知識為驅動，並未過多依賴視覺內容。為彌補這一差距，我們引入了VideoReasonBench，這是一個專為評估以視覺為核心的複雜視頻推理而設計的基準測試。為了確保視覺豐富性和高推理複雜性，VideoReasonBench中的每個視頻都描繪了一系列針對僅在視頻部分片段可見的潛在狀態的精細操作。這些問題評估了視頻推理技能的三個遞進層次：回憶觀察到的視覺信息、推斷潛在狀態的內容，以及預測視頻之外的信息。在這樣的任務設置下，模型必須精確回憶視頻中的多個操作，並通過逐步推理來獲得這些問題的正確最終答案。利用VideoReasonBench，我們全面評估了18個最先進的多模態LLMs（MLLMs），發現大多數模型在複雜視頻推理上表現不佳，例如，GPT-4o僅達到6.9%的準確率，而思維增強的Gemini-2.5-Pro以56.0%的準確率顯著超越其他模型。我們對“測試時擴展”的進一步研究表明，雖然在現有的視頻基準測試上延長思維預算幾乎或完全沒有帶來益處，但對於提升VideoReasonBench上的表現卻是至關重要的。

English

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

VideoReasonBench：多模態大語言模型能否執行視覺中心的複雜視頻推理？

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

摘要

Support