VideoReasonBench: MLLMは視覚中心の複雑なビデオ推論を実行できるか？

要旨

最近の研究では、長い連鎖的思考（CoT）推論が、大規模言語モデル（LLM）の複雑なタスクにおけるパフォーマンスを大幅に向上させることが示されています。しかし、この利点は、ビデオ理解の分野ではまだ実証されていません。既存のベンチマークのほとんどが、拡張されたCoT連鎖の利点を示すために必要な推論の深さを欠いているためです。最近の取り組みでは、ビデオ推論を目的としたベンチマークが提案されていますが、これらのタスクはしばしば知識駆動型であり、視覚的な内容に大きく依存していません。このギャップを埋めるため、我々は視覚中心の複雑なビデオ推論を評価するためのベンチマークであるVideoReasonBenchを導入します。視覚的な豊かさと高い推論の複雑さを確保するため、VideoReasonBenchの各ビデオは、ビデオの一部でのみ見える潜在的な状態に対する細かい操作のシーケンスを描いています。質問は、観察された視覚情報を思い出すこと、潜在的な状態の内容を推論すること、ビデオを超えた情報を予測することという、3つの段階的なビデオ推論スキルを評価します。このタスク設定では、モデルはビデオ内の複数の操作を正確に思い出し、段階的な推論を行ってこれらの質問に対する正しい最終的な答えを得る必要があります。VideoReasonBenchを使用して、18の最先端のマルチモーダルLLM（MLLM）を包括的に評価した結果、ほとんどのモデルが複雑なビデオ推論において低いパフォーマンスを示すことがわかりました。例えば、GPT-4oはわずか6.9%の精度しか達成できませんでしたが、思考を強化したGemini-2.5-Proは56.0%の精度で他のモデルを大きく上回りました。我々の「テストタイムスケーリング」に関する調査は、拡張された思考予算が既存のビデオベンチマークではほとんどまたは全く効果がない一方で、VideoReasonBenchのパフォーマンスを向上させるために不可欠であることをさらに明らかにしました。

English

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

VideoReasonBench: MLLMは視覚中心の複雑なビデオ推論を実行できるか？

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

要旨

Support