VCRBench: 大規模ビデオ言語モデルの長文因果推論能力の探求

要旨

ビデオ理解における最近の進展にもかかわらず、大規模ビデオ言語モデル（LVLM）がビデオベースの因果推論を実行する能力は、視覚的かつ目標指向の設定における因果推論を評価するための関連性のある専用のベンチマークが存在しないため、十分に探求されていません。このギャップを埋めるために、我々はVideo-based long-form Causal Reasoning（VCRBench）という新しいベンチマークを導入します。VCRBenchは、日常の簡単な活動の手順ビデオを使用して作成されており、各クリップが重要な因果イベントを捉えるようにステップが意図的にシャッフルされています。これにより、LVLMが特定の目標を達成するために必要なイベントを識別し、推論し、正しく順序付けることができるかどうかをテストします。さらに、このベンチマークは、多肢選択や二択のQA形式で見られるような言語的ショートカットをLVLMが利用することを防ぐように慎重に設計されており、同時に、自由回答形式のQAの評価に関連する課題も回避しています。VCRBenchにおける最先端のLVLMの評価は、これらのモデルがビデオベースの長文因果推論に苦戦していることを示唆しており、主に視覚的観察から直接長距離の因果依存関係をモデル化する難しさによるものです。このような能力を可能にするための簡単なステップとして、我々はRecognition-Reasoning Decomposition（RRD）を提案します。これは、ビデオベースの因果推論をビデオ認識と因果推論の2つのサブタスクに分解するモジュール方式です。VCRBenchにおける実験では、RRDが精度を最大25.2％向上させることが示されました。最後に、我々の詳細な分析は、例えば、LVLMが複雑なビデオベースの長文因果推論タスクにおいて主に言語知識に依存していることなど、興味深い洞察を明らかにしています。

English

Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with evaluating open-ended QA. Our evaluation of state-of-the-art LVLMs on VCRBench suggests that these models struggle with video-based long-form causal reasoning, primarily due to their difficulty in modeling long-range causal dependencies directly from visual observations. As a simple step toward enabling such capabilities, we propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning. Our experiments on VCRBench show that RRD significantly boosts accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis reveals interesting insights, for instance, that LVLMs primarily rely on language knowledge for complex video-based long-form causal reasoning tasks.

VCRBench: 大規模ビデオ言語モデルの長文因果推論能力の探求

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

要旨

Support