迈向视频思维测试：面向高级视频推理与理解的全方位基准

摘要

人类智能要求正确性与鲁棒性，其中前者是后者的基础。在视频理解领域，正确性确保了视觉内容的准确解读，而鲁棒性则保证了在复杂条件下性能的稳定。尽管视频大语言模型（video LLMs）取得了进展，现有基准测试仍未能充分反映这些模型与人类智能在视频解读中保持正确性和鲁棒性方面的差距。为此，我们引入了视频思维测试（Video-TT），旨在评估视频LLMs是否能够像人类一样有效解读现实世界中的视频。Video-TT真实反映了在理解复杂视觉叙事方面的差距，并通过自然对抗性问题来评估模型的鲁棒性。该测试包含1000个YouTube短视频，每个视频配有一个开放式问题及四个针对视觉与叙事复杂性的对抗性问题。我们的评估结果显示，视频LLMs与人类表现之间存在显著差距。

English

Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.