ビデオLLMベンチマークの分解：知識、空間知覚、真の時間的理解？

要旨

既存の映像理解ベンチマークは、知識ベースの質問と純粋に画像ベースの質問を混同しがちであり、映像理解を他のモダリティと区別する重要な側面である時間的推論能力を明確に分離していない。我々は、高いスコアが真に映像の動的コンテンツの理解力を示しているかどうかを曖昧にする2つの主要な限界を特定した：（1）強力な言語事前分布、すなわちモデルが映像を見ることなく質問に答えられること；（2）シャッフル不変性、すなわち映像フレームが時間的にシャッフルされても、特定の質問に対してモデルが同様の性能を維持すること。これらの問題を緩和するため、我々はVBenchCompを提案する。これは、質問を異なるドメインに分類する自動化されたパイプラインである：LLM-Answerable、Semantic、およびTemporal。具体的には、LLM-Answerableな質問は映像を見ることなく回答可能であり、Semanticな質問は映像フレームがシャッフルされても回答可能であり、Temporalな質問はフレームの正しい時間的順序を理解する必要がある。残りの質問はOthersとしてラベル付けされる。これにより、映像LLMの異なる能力を細かく評価することが可能となる。我々の分析は、従来の総合スコアでは隠されていたモデルの微妙な弱点を明らかにし、将来のベンチマークを設計する際に映像LLMをより正確に評価するための洞察と推奨事項を提供する。

English

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

ビデオLLMベンチマークの分解：知識、空間知覚、真の時間的理解？

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

要旨

Support