V-STaR:基於視頻時空推理的視頻大語言模型基準測試
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
March 14, 2025
作者: Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong
cs.AI
摘要
人類處理視頻推理時遵循一種時序性的空間-時間推理邏輯,我們首先識別相關的幀(“何時”),然後分析關鍵物體之間的空間關係(“何處”),最後利用這些關係進行推斷(“什麼”)。然而,視頻大型語言模型(Video-LLMs)是否也能在視頻中“通過時序性的空間-時間邏輯進行推理”呢?現有的Video-LLM基準測試主要集中於評估物體的存在,而忽略了關係推理。因此,很難衡量一個模型是否真正理解了視頻中的物體互動(動作/事件),還是僅僅依賴於預訓練的“記憶”中的共現偏見來生成答案。在本研究中,我們引入了一個視頻空間-時間推理(V-STaR)基準來解決這些不足。其核心思想是將視頻理解分解為一個反向空間-時間推理(RSTR)任務,該任務同時評估了哪些物體存在、事件何時發生以及它們位於何處,同時捕捉底層的思維鏈(CoT)邏輯。為了支持這一評估,我們構建了一個數據集來引導Video-LLMs的空間-時間推理過程。該數據集包含由半自動化的GPT-4驅動的管道生成的從粗到細的CoT問題,嵌入顯式的推理鏈以模擬人類認知。在我們的V-STaR上對14個Video-LLMs進行的實驗揭示了當前Video-LLMs與穩健且一致的空間-時間推理需求之間存在顯著差距。
English
Human processes video reasoning in a sequential spatio-temporal reasoning
logic, we first identify the relevant frames ("when") and then analyse the
spatial relationships ("where") between key objects, and finally leverage these
relationships to draw inferences ("what"). However, can Video Large Language
Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in
videos? Existing Video-LLM benchmarks primarily focus on assessing object
presence, neglecting relational reasoning. Consequently, it is difficult to
measure whether a model truly comprehends object interactions (actions/events)
in videos or merely relies on pre-trained "memory" of co-occurrences as biases
in generating answers. In this work, we introduce a Video Spatio-Temporal
Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to
decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR)
task that simultaneously evaluates what objects are present, when events occur,
and where they are located while capturing the underlying Chain-of-thought
(CoT) logic. To support this evaluation, we construct a dataset to elicit the
spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine
CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding
explicit reasoning chains to mimic human cognition. Experiments from 14
Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and
the needs for robust and consistent spatio-temporal reasoning.Summary
AI-Generated Summary