SEED-Bench-2-Plus:使用文本丰富的视觉理解对多模态大型语言模型进行基准测试
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
April 25, 2024
作者: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan
cs.AI
摘要
理解文本丰富的视觉内容对于实际应用多模态大型语言模型(MLLMs)至关重要,因为文本丰富的场景在现实世界中随处可见,其特点是图像中嵌入了大量文本。最近,具有出色多功能性的MLLMs的出现提高了我们对MLLMs期望的标准。然而,由于当前MLLM基准主要侧重于评估一般视觉理解能力,它们在文本丰富的场景中的熟练程度尚未得到全面和客观的评估。在这项工作中,我们介绍了SEED-Bench-2-Plus,这是一个专门设计用于评估MLLMs文本丰富的视觉理解能力的基准。我们的基准包括2.3K个带有精确人工注释的多项选择题,涵盖三个广泛类别:图表、地图和网络,每个类别都涵盖了现实世界中各种文本丰富的场景。由于这些类别固有的复杂性和多样性,它们有效地模拟了现实世界中的文本丰富环境。我们进一步进行了一项全面评估,涉及34个知名MLLMs(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus),并强调了MLLMs在文本丰富的视觉理解方面目前的局限性。我们希望我们的工作能成为现有MLLM基准的有价值补充,提供深刻的观察,并激发在MLLMs文本丰富的视觉理解领域进一步研究。数据集和评估代码可在https://github.com/AILab-CVC/SEED-Bench 上获取。
English
Comprehending text-rich visual content is paramount for the practical
application of Multimodal Large Language Models (MLLMs), since text-rich
scenarios are ubiquitous in the real world, which are characterized by the
presence of extensive texts embedded within images. Recently, the advent of
MLLMs with impressive versatility has raised the bar for what we can expect
from MLLMs. However, their proficiency in text-rich scenarios has yet to be
comprehensively and objectively assessed, since current MLLM benchmarks
primarily focus on evaluating general visual comprehension. In this work, we
introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating
text-rich visual comprehension of MLLMs. Our benchmark comprises 2.3K
multiple-choice questions with precise human annotations, spanning three broad
categories: Charts, Maps, and Webs, each of which covers a wide spectrum of
text-rich scenarios in the real world. These categories, due to their inherent
complexity and diversity, effectively simulate real-world text-rich
environments. We further conduct a thorough evaluation involving 34 prominent
MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the
current limitations of MLLMs in text-rich visual comprehension. We hope that
our work can serve as a valuable addition to existing MLLM benchmarks,
providing insightful observations and inspiring further research in the area of
text-rich visual comprehension with MLLMs. The dataset and evaluation code can
be accessed at https://github.com/AILab-CVC/SEED-Bench.Summary
AI-Generated Summary