V-STaR:视频大语言模型在视频时空推理上的基准测试
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
March 14, 2025
作者: Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong
cs.AI
摘要
人类处理视频推理时遵循一种序列化的时空推理逻辑:首先识别相关帧(“何时”),然后分析关键对象之间的空间关系(“何处”),最后利用这些关系进行推断(“何事”)。然而,视频大语言模型(Video-LLMs)是否也能在视频中“通过序列化的时空逻辑进行推理”呢?现有的Video-LLM基准测试主要侧重于评估对象的存在性,而忽视了关系推理。因此,难以衡量模型是否真正理解了视频中对象的交互(动作/事件),还是仅仅依赖预训练中的共现“记忆”作为生成答案的偏见。在本研究中,我们引入了视频时空推理(V-STaR)基准测试以解决这些不足。其核心思想是将视频理解分解为逆向时空推理(RSTR)任务,同时评估对象的存在、事件发生的时间及其位置,并捕捉背后的思维链(CoT)逻辑。为支持这一评估,我们构建了一个数据集,旨在激发Video-LLMs的时空推理过程。该数据集包含由半自动化GPT-4驱动流程生成的从粗到细的CoT问题,嵌入显式推理链以模拟人类认知。基于14个Video-LLMs在V-STaR上的实验揭示了当前Video-LLMs与稳健且一致的时空推理需求之间存在显著差距。
English
Human processes video reasoning in a sequential spatio-temporal reasoning
logic, we first identify the relevant frames ("when") and then analyse the
spatial relationships ("where") between key objects, and finally leverage these
relationships to draw inferences ("what"). However, can Video Large Language
Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in
videos? Existing Video-LLM benchmarks primarily focus on assessing object
presence, neglecting relational reasoning. Consequently, it is difficult to
measure whether a model truly comprehends object interactions (actions/events)
in videos or merely relies on pre-trained "memory" of co-occurrences as biases
in generating answers. In this work, we introduce a Video Spatio-Temporal
Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to
decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR)
task that simultaneously evaluates what objects are present, when events occur,
and where they are located while capturing the underlying Chain-of-thought
(CoT) logic. To support this evaluation, we construct a dataset to elicit the
spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine
CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding
explicit reasoning chains to mimic human cognition. Experiments from 14
Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and
the needs for robust and consistent spatio-temporal reasoning.Summary
AI-Generated Summary