VCRBench:探索大型视频语言模型的长篇因果推理能力
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models
May 13, 2025
作者: Pritam Sarkar, Ali Etemad
cs.AI
摘要
尽管视频理解领域近期取得了进展,大型视频语言模型(LVLMs)在执行基于视频的因果推理方面的能力仍未被充分探索,这主要归因于缺乏相关且专门的基准来评估视觉基础和目标驱动环境下的因果推理。为填补这一空白,我们引入了一个名为基于视频的长篇因果推理(VCRBench)的新基准。我们利用日常简单活动的程序化视频构建了VCRBench,其中步骤被故意打乱,每个片段捕捉一个关键的因果事件,以测试LVLMs是否能识别、推理并正确排序实现特定目标所需的事件。此外,该基准经过精心设计,防止LVLMs利用语言捷径,如在多项选择或二元问答格式中常见的那样,同时避免了评估开放式问答的挑战。我们对VCRBench上最先进的LVLMs的评估表明,这些模型在基于视频的长篇因果推理上表现不佳,主要原因是它们难以直接从视觉观察中建模长程因果依赖关系。作为迈向这一能力的一小步,我们提出了识别-推理分解(RRD),一种模块化方法,将基于视频的因果推理分解为视频识别和因果推理两个子任务。我们在VCRBench上的实验显示,RRD显著提升了准确率,最高增益达25.2%。最后,我们的深入分析揭示了有趣的见解,例如,LVLMs在复杂的基于视频的长篇因果推理任务中主要依赖语言知识。
English
Despite recent advances in video understanding, the capabilities of Large
Video Language Models (LVLMs) to perform video-based causal reasoning remains
underexplored, largely due to the absence of relevant and dedicated benchmarks
for evaluating causal reasoning in visually grounded and goal-driven settings.
To fill this gap, we introduce a novel benchmark named Video-based long-form
Causal Reasoning (VCRBench). We create VCRBench using procedural videos of
simple everyday activities, where the steps are deliberately shuffled with each
clip capturing a key causal event, to test whether LVLMs can identify, reason
about, and correctly sequence the events needed to accomplish a specific goal.
Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting
linguistic shortcuts, as seen in multiple-choice or binary QA formats, while
also avoiding the challenges associated with evaluating open-ended QA. Our
evaluation of state-of-the-art LVLMs on VCRBench suggests that these models
struggle with video-based long-form causal reasoning, primarily due to their
difficulty in modeling long-range causal dependencies directly from visual
observations. As a simple step toward enabling such capabilities, we propose
Recognition-Reasoning Decomposition (RRD), a modular approach that breaks
video-based causal reasoning into two sub-tasks of video recognition and causal
reasoning. Our experiments on VCRBench show that RRD significantly boosts
accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis
reveals interesting insights, for instance, that LVLMs primarily rely on
language knowledge for complex video-based long-form causal reasoning tasks.Summary
AI-Generated Summary