ChatPaper.aiChatPaper

邁向視頻思維測試:一個用於高級視頻推理與理解的綜合基準

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

July 20, 2025
作者: Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu
cs.AI

摘要

人類智能要求正確性和魯棒性,其中前者是後者的基礎。在視頻理解中,正確性確保了對視覺內容的準確解讀,而魯棒性則保證了在挑戰性條件下的一致性能。儘管視頻大語言模型(video LLMs)取得了進展,現有的基準測試未能充分反映這些模型與人類智能在保持視頻解讀正確性和魯棒性方面的差距。我們引入了視頻思維測試(Video-TT),以評估視頻LLMs是否能像人類一樣有效地解讀真實世界的視頻。Video-TT反映了在理解複雜視覺敘事方面的真實差距,並評估了對抗自然對抗性問題的魯棒性。Video-TT包含1,000個YouTube Shorts視頻,每個視頻配有一個開放式問題和四個對抗性問題,這些問題探討了視覺和敘事的複雜性。我們的評估顯示,視頻LLMs與人類表現之間存在顯著差距。
English
Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.
PDF201July 22, 2025