ChatPaper.aiChatPaper

MM-Vet v2:一個具有挑戰性的基準測試,用於評估具有整合能力的大型多模型模型

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

August 1, 2024
作者: Weihao Yu, Zhengyuan Yang, Linfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, Xinchao Wang
cs.AI

摘要

MM-Vet是一種針對評估整合能力的開放式視覺語言問題,已成為大型多模型模型評估中最受歡迎的基準之一。MM-Vet評估六個核心視覺語言(VL)能力:識別、知識、空間意識、語言生成、OCR和數學。然而,其問題格式僅限於單一圖像-文字配對,缺乏現實情境中常見的交錯圖像和文字序列。為解決此限制,我們引入了MM-Vet v2,其中包括一種名為「圖像-文字序列理解」的新VL能力,評估模型處理VL序列的能力。此外,我們保持了評估樣本的高質量,同時進一步擴大了評估集的大小。使用MM-Vet v2來評估大型多模型模型,我們發現Claude 3.5 Sonnet是最佳模型,得分為71.8,略高於得分為71.0的GPT-4o。在開放權重模型中,InternVL2-Llama3-76B以68.4的得分領先。
English
MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.

Summary

AI-Generated Summary

PDF149November 28, 2024