ChatPaper.aiChatPaper

SEED-Bench-2-Plus:使用文本豐富的視覺理解基準測試多模式大型語言模型

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

April 25, 2024
作者: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan
cs.AI

摘要

為了實際應用多模式大型語言模型(MLLMs),理解文本豐富的視覺內容至關重要,因為現實世界中充斥著以圖像為載體的大量文本內容。最近,具有出色多功能性的MLLMs的出現提高了我們對MLLMs的期望。然而,由於目前的MLLM基準主要集中於評估一般視覺理解能力,因此對它們在文本豐富情境中的熟練程度尚未得到全面和客觀的評估。在本研究中,我們介紹了SEED-Bench-2-Plus,這是一個專門設計用於評估MLLMs文本豐富視覺理解能力的基準。我們的基準包含2.3K個精確人工標註的多項選擇問題,涵蓋三個廣泛類別:圖表、地圖和網頁,每個類別都涵蓋了現實世界中各種文本豐富情境。這些類別由於其固有的複雜性和多樣性,有效地模擬了現實世界的文本豐富環境。我們進一步對34個知名MLLMs(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus)進行了全面評估,並強調了MLLMs在文本豐富視覺理解方面的目前限制。我們希望我們的工作能成為現有MLLM基準的有價值補充,提供深入觀察並激發在MLLMs文本豐富視覺理解領域進一步研究的靈感。數據集和評估代碼可在https://github.com/AILab-CVC/SEED-Bench 上獲取。
English
Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating text-rich visual comprehension of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at https://github.com/AILab-CVC/SEED-Bench.

Summary

AI-Generated Summary

PDF91December 15, 2024