多模態影片理解中的視覺狀態追蹤基準測試

摘要

理解一段影片需要的不僅是辨識孤立的瞬間，因為人類會持續追蹤時間軸上的實體、狀態與事件。這種視覺狀態追蹤能力是影片理解的基礎，然而在當前多模態大型語言模型（MLLMs）的評測中仍未充分探討。我們提出視覺狀態追蹤基準（VSTAT），這是一個以影片為基礎的基準，旨在診斷 MLLMs 的視覺狀態追蹤能力。VSTAT 包含 834 段取自合成與真實世界影片的片段，搭配 1,500 個無法從單一影格或短片段回答的問題，需要持續感知並整合整段影片串流中的事件。儘管目前最先進的 MLLMs 在現有影片基準上表現強勁，我們發現它們在 VSTAT 上的表現遠低於人類，僅略優於基於答案先驗的基線。為分析此差距，我們比較 MLLMs 的思考軌跡與底層影片串流，以理解 MLLMs 在 VSTAT 上失敗的原因與時機。我們發現 MLLMs 能在文字上正確推理與追蹤，但在視覺上卻無法感知其所需追蹤的事件。最後，我們的初步評估顯示，近期基於代理的方法（包括基於 MLLM 的影片代理與編碼代理）並未輕易解決這些失敗，在 VSTAT 上仍顯不足。

English

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.