VCRBench：探索大型視頻語言模型的長篇因果推理能力

摘要

儘管近期在視頻理解領域取得了進展，大型視頻語言模型（LVLMs）在執行基於視頻的因果推理方面的能力仍未被充分探索，這主要歸因於缺乏相關且專用的基準來評估視覺基礎和目標驅動設置中的因果推理。為填補這一空白，我們引入了一個名為基於視頻的長篇因果推理（VCRBench）的新基準。我們利用日常簡單活動的程序性視頻創建了VCRBench，其中步驟被故意打亂，每個片段捕捉一個關鍵的因果事件，以測試LVLMs是否能夠識別、推理並正確排序實現特定目標所需的事件。此外，該基準經過精心設計，防止LVLMs利用多選或二進制問答格式中的語言捷徑，同時也避免了評估開放式問答所帶來的挑戰。我們在VCRBench上對最先進的LVLMs進行評估，結果表明這些模型在基於視頻的長篇因果推理方面存在困難，主要是由於它們難以直接從視覺觀察中建模長程因果依賴。作為實現此類能力的初步嘗試，我們提出了識別-推理分解（RRD），這是一種模塊化方法，將基於視頻的因果推理分解為視頻識別和因果推理兩個子任務。我們在VCRBench上的實驗表明，RRD顯著提高了VCRBench的準確率，增益高達25.2%。最後，我們的深入分析揭示了有趣的見解，例如，LVLMs在複雜的基於視頻的長篇因果推理任務中主要依賴於語言知識。

English

Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with evaluating open-ended QA. Our evaluation of state-of-the-art LVLMs on VCRBench suggests that these models struggle with video-based long-form causal reasoning, primarily due to their difficulty in modeling long-range causal dependencies directly from visual observations. As a simple step toward enabling such capabilities, we propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning. Our experiments on VCRBench show that RRD significantly boosts accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis reveals interesting insights, for instance, that LVLMs primarily rely on language knowledge for complex video-based long-form causal reasoning tasks.

VCRBench：探索大型視頻語言模型的長篇因果推理能力

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

摘要

Support