视频-MME:视频分析中多模态LLM的首个全面评估基准。
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
May 31, 2024
作者: Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun
cs.AI
摘要
在追求人工通用智能的过程中,多模态大型语言模型(MLLMs)已成为最近进展的焦点。然而,主要关注仍然集中在发展它们在静态图像理解方面的能力上。MLLMs在处理序列视觉数据方面的潜力仍然未被充分探索,突显了对其性能进行全面、高质量评估的缺失。在本文中,我们介绍了Video-MME,这是首个全谱多模态评估基准,用于MLLMs在视频分析中。我们的工作通过四个关键特点与现有基准有所区别:1)视频类型的多样性,涵盖了6个主要视觉领域,30个子领域,以确保广泛的场景泛化能力;2)时间维度的持续性,包括短、中、长期视频,范围从11秒到1小时,以获得强大的上下文动态;3)数据模态的广度,整合了视频帧之外的多模态输入,包括字幕和音频,以揭示MLLMs的全面能力;4)注释的质量,利用专家注释者进行严格手动标注,以促进精确可靠的模型评估。我们手动选择了900个视频,总计256小时,并通过反复观看所有视频内容进行了注释,产生了2700个问答对。通过Video-MME,我们广泛评估了各种最先进的MLLMs,包括GPT-4系列和Gemini 1.5 Pro,以及开源图像模型如InternVL-Chat-V1.5和视频模型如LLaVA-NeXT-Video。我们的实验表明,Gemini 1.5 Pro是表现最佳的商业模型,明显优于开源模型。我们的数据集以及这些发现强调了在处理更长序列和多模态数据方面需要进一步改进的必要性。项目页面:https://video-mme.github.io
English
In the quest for artificial general intelligence, Multi-modal Large Language
Models (MLLMs) have emerged as a focal point in recent advancements. However,
the predominant focus remains on developing their capabilities in static image
understanding. The potential of MLLMs in processing sequential visual data is
still insufficiently explored, highlighting the absence of a comprehensive,
high-quality assessment of their performance. In this paper, we introduce
Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of
MLLMs in Video analysis. Our work distinguishes from existing benchmarks
through four key features: 1) Diversity in video types, spanning 6 primary
visual domains with 30 subfields to ensure broad scenario generalizability; 2)
Duration in temporal dimension, encompassing both short-, medium-, and
long-term videos, ranging from 11 seconds to 1 hour, for robust contextual
dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides
video frames, including subtitles and audios, to unveil the all-round
capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual
labeling by expert annotators to facilitate precise and reliable model
assessment. 900 videos with a total of 256 hours are manually selected and
annotated by repeatedly viewing all the video content, resulting in 2,700
question-answer pairs. With Video-MME, we extensively evaluate various
state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as
open-source image models like InternVL-Chat-V1.5 and video models like
LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the
best-performing commercial model, significantly outperforming the open-source
models. Our dataset along with these findings underscores the need for further
improvements in handling longer sequences and multi-modal data. Project Page:
https://video-mme.github.ioSummary
AI-Generated Summary