MMBench-Video:一种用于整体视频理解的长形式多镜头基准测试
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
June 20, 2024
作者: Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen
cs.AI
摘要
大型视觉语言模型(LVLMs)的出现推动了对它们在多模态环境中的应用的研究,特别是在视频理解方面。传统的视频问答基准尽管提供了定量指标,但往往无法涵盖视频内容的全部范围,并且未能充分评估模型的时间理解能力。为了解决这些局限性,我们引入了MMBench-Video,这是一个定量基准,旨在严格评估LVLMs在视频理解方面的熟练程度。MMBench-Video包含来自YouTube的长视频,并使用自由形式的问题,反映了实际用例。该基准经过精心设计,旨在探究模型的时间推理能力,所有问题均按照精心构建的能力分类法由人类进行注释。我们使用GPT-4进行自动评估,展示了相对于早期基于LLM的评估而言更高的准确性和稳健性。利用MMBench-Video,我们进行了全面评估,包括针对图像和视频的专有和开源LVLMs。MMBench-Video是研究社区的宝贵资源,有助于改进LVLMs的评估,并推动视频理解领域的进展。MMBench-Video的评估代码将集成到VLMEvalKit中:https://github.com/open-compass/VLMEvalKit。
English
The advent of large vision-language models (LVLMs) has spurred research into
their applications in multi-modal contexts, particularly in video
understanding. Traditional VideoQA benchmarks, despite providing quantitative
metrics, often fail to encompass the full spectrum of video content and
inadequately assess models' temporal comprehension. To address these
limitations, we introduce MMBench-Video, a quantitative benchmark designed to
rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video
incorporates lengthy videos from YouTube and employs free-form questions,
mirroring practical use cases. The benchmark is meticulously crafted to probe
the models' temporal reasoning skills, with all questions human-annotated
according to a carefully constructed ability taxonomy. We employ GPT-4 for
automated assessment, demonstrating superior accuracy and robustness over
earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted
comprehensive evaluations that include both proprietary and open-source LVLMs
for images and videos. MMBench-Video stands as a valuable resource for the
research community, facilitating improved evaluation of LVLMs and catalyzing
progress in the field of video understanding. The evalutation code of
MMBench-Video will be integrated into VLMEvalKit:
https://github.com/open-compass/VLMEvalKit.Summary
AI-Generated Summary