V-STaR：视频大语言模型在视频时空推理上的基准测试

摘要

人类处理视频推理时遵循一种序列化的时空推理逻辑：首先识别相关帧（“何时”），然后分析关键对象之间的空间关系（“何处”），最后利用这些关系进行推断（“何事”）。然而，视频大语言模型（Video-LLMs）是否也能在视频中“通过序列化的时空逻辑进行推理”呢？现有的Video-LLM基准测试主要侧重于评估对象的存在性，而忽视了关系推理。因此，难以衡量模型是否真正理解了视频中对象的交互（动作/事件），还是仅仅依赖预训练中的共现“记忆”作为生成答案的偏见。在本研究中，我们引入了视频时空推理（V-STaR）基准测试以解决这些不足。其核心思想是将视频理解分解为逆向时空推理（RSTR）任务，同时评估对象的存在、事件发生的时间及其位置，并捕捉背后的思维链（CoT）逻辑。为支持这一评估，我们构建了一个数据集，旨在激发Video-LLMs的时空推理过程。该数据集包含由半自动化GPT-4驱动流程生成的从粗到细的CoT问题，嵌入显式推理链以模拟人类认知。基于14个Video-LLMs在V-STaR上的实验揭示了当前Video-LLMs与稳健且一致的时空推理需求之间存在显著差距。

English

Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

V-STaR：视频大语言模型在视频时空推理上的基准测试

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

摘要

Support