マルチモーダル動画理解における視覚的状態追跡のベンチマーク評価

要旨

動画を理解するには、孤立した瞬間を認識するだけでは不十分であり、人間は時間の経過とともにエンティティ、状態、イベントを継続的に追跡する。この視覚的状態追跡能力は動画理解の基盤であるが、現在のマルチモーダル大規模言語モデル（MLLM）の評価では未解明のままである。我々は、MLLMにおける視覚的状態追跡を診断するために設計された動画ベンチマーク、Visual State Tracking benchmark（VSTAT）を導入する。VSTATは、合成および実世界の動画から抽出された834のクリップと、それに対応する1,500の質問から構成される。これらの質問は単一フレームや短いセグメントからは回答できず、動画全体にわたるイベントの継続的な知覚と統合を必要とする。既存の動画ベンチマークで高い性能を示すにもかかわらず、最先端のMLLMは人間の性能に遠く及ばず、回答事前分布ベースラインをわずかに上回る程度であることが判明した。このギャップを分析するため、MLLMの思考の軌跡と基盤となる動画ストリームを比較し、MLLMがVSTATでなぜ、いつ失敗するのかを理解する。その結果、MLLMはテキスト上では推論と追跡を正しく行うものの、追跡すべきイベントを視覚的に知覚することに失敗していることが分かった。最後に、予備的評価では、MLLMベースの動画エージェントやコーディングエージェントを含む最近のエージェント的アプローチでは、これらの失敗を容易に解決できず、VSTATにおいて依然として不足していることが示唆される。

English

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.