多模态视频理解中的视觉状态追踪基准评估

摘要

理解视频需要超越对孤立时刻的识别，因为人类会持续跟踪实体、状态和事件的时间演变过程。这种视觉状态跟踪能力是视频理解的基础，但当前针对多模态大语言模型（MLLMs）的评估中尚未充分探索这一能力。我们提出视频状态跟踪基准（VSTAT），这是一个基于视频的基准测试，旨在诊断MLLMs的视觉状态跟踪能力。VSTAT包含从合成与真实世界视频中提取的834个片段，并配以1500道无法仅凭单帧或短片段回答的问题，要求模型持续感知并整合整个视频流中的事件。尽管现有视频基准测试表现强劲，我们发现最先进的MLLMs远低于人类水平，仅略优于基于答案先验的基线模型。为分析这一差距，我们比较了MLLMs的思维轨迹与底层视频流，以理解MLLMs在VSTAT上失败的原因与时机。研究发现，MLLMs能够在文本层面正确推理和跟踪，但在视觉感知其需要跟踪的事件时存在失败。最后，初步评估表明，近期基于智能体的方法（包括基于MLLM的视频智能体和编码智能体）并未能轻易解决这些失败，在VSTAT上仍表现不足。

English

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.