ビデオ思考テストに向けて：高度なビデオ推論と理解のための包括的ベンチマーク

要旨

人間の知能には正確性と頑健性が求められ、前者は後者の基盤となる。映像理解において、正確性は視覚的内容の適切な解釈を保証し、頑健性は困難な条件下でも一貫した性能を維持する。ビデオ大規模言語モデル（video LLMs）の進展にもかかわらず、既存のベンチマークは、映像解釈におけるこれらのモデルと人間の知能の間の正確性と頑健性のギャップを十分に反映していない。本論文では、ビデオLLMsが実世界の映像を人間と同様に効果的に解釈できるかを評価するため、Video Thinking Test（Video-TT）を提案する。Video-TTは、複雑な視覚的物語の理解における真のギャップを反映し、自然な敵対的質問に対する頑健性を評価する。Video-TTは1,000本のYouTube Shorts映像で構成され、各映像には1つの自由回答形式の質問と、視覚的および物語的複雑性を探る4つの敵対的質問が付随する。評価の結果、ビデオLLMsと人間のパフォーマンスの間に有意なギャップが確認された。

English

Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.

ビデオ思考テストに向けて：高度なビデオ推論と理解のための包括的ベンチマーク

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

要旨

Support