MM-Vet v2:一个具有挑战性的基准测试,用于评估集成能力强大的大型多模态模型
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
August 1, 2024
作者: Weihao Yu, Zhengyuan Yang, Linfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, Xinchao Wang
cs.AI
摘要
MM-Vet,通过针对评估综合能力的开放式视觉-语言问题,已成为最受欢迎的大型多模态模型评估基准之一。MM-Vet评估六个核心视觉-语言(VL)能力:识别、知识、空间意识、语言生成、OCR和数学。然而,其问题格式仅限于单个图像-文本对,缺乏现实场景中普遍存在的交错图像和文本序列。为解决这一限制,我们引入了MM-Vet v2,其中包括一种名为“图像-文本序列理解”的新的VL能力,评估模型处理VL序列的能力。此外,我们保持了评估样本的高质量,同时进一步扩大了评估集的规模。使用MM-Vet v2来评估大型多模态模型,我们发现Claude 3.5 Sonnet是最佳模型,得分为71.8,略高于得分为71.0的GPT-4o。在开放权重模型中,InternVL2-Llama3-76B以68.4的得分领先。
English
MM-Vet, with open-ended vision-language questions targeting at evaluating
integrated capabilities, has become one of the most popular benchmarks for
large multimodal model evaluation. MM-Vet assesses six core vision-language
(VL) capabilities: recognition, knowledge, spatial awareness, language
generation, OCR, and math. However, its question format is restricted to single
image-text pairs, lacking the interleaved image and text sequences prevalent in
real-world scenarios. To address this limitation, we introduce MM-Vet v2, which
includes a new VL capability called "image-text sequence understanding",
evaluating models' ability to process VL sequences. Furthermore, we maintain
the high quality of evaluation samples while further expanding the evaluation
set size. Using MM-Vet v2 to benchmark large multimodal models, we found that
Claude 3.5 Sonnet is the best model with a score of 71.8, slightly
outperforming GPT-4o which scored 71.0. Among open-weight models,
InternVL2-Llama3-76B leads with a score of 68.4.Summary
AI-Generated Summary