ChatPaper.aiChatPaper

MMIU:用于评估大型视觉语言模型的多模态多图像理解

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

August 5, 2024
作者: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
cs.AI

摘要

对于大型视觉语言模型(LVLMs)来说,处理多个图像的能力至关重要,以便更全面、细致地理解场景。最近的多图像LVLMs已经开始解决这一需求。然而,它们的评估并没有跟上其发展的步伐。为了填补这一空白,我们引入了多模态多图像理解(MMIU)基准,这是一个全面的评估套件,旨在评估LVLMs在各种多图像任务中的表现。MMIU涵盖了7种类型的多图像关系,52个任务,77K张图像和11K个精心策划的多项选择题,使其成为同类基准中最为广泛的一个。我们对包括开源和专有模型在内的24个热门LVLMs进行评估,发现在多图像理解方面存在着重大挑战,特别是涉及空间理解的任务。即使是最先进的模型,如GPT-4o,在MMIU上的准确率也仅为55.7%。通过多方面的分析实验,我们确定了关键的性能差距和限制,为未来模型和数据改进提供了宝贵的见解。我们的目标是通过MMIU推动LVLM研究和发展的前沿,使我们朝着实现复杂的多模态多图像用户交互迈进。
English
The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development, moving us toward achieving sophisticated multimodal multi-image user interactions.

Summary

AI-Generated Summary

PDF623November 28, 2024