ChatPaper.aiChatPaper

MIBench:在多个图像上评估多模态大型语言模型

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

July 21, 2024
作者: Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
cs.AI

摘要

基于LLM的强大能力,近期许多多模态大型语言模型(MLLMs)在多个基准测试中取得了显著的视觉-语言任务性能。然而,大多数现有的MLLMs和基准测试主要关注单图像输入场景,对MLLMs处理现实多图像时的性能尚未充分探讨。虽然一些基准测试考虑了多图像情况,但它们的评估维度和样本非常有限。因此,在本文中,我们提出了一个新的基准测试MIBench,全面评估MLLMs在多图像场景中的细粒度能力。具体而言,MIBench将多图像能力分为三种情景:多图像指导(MII)、多模态知识获取(MKS)和多模态上下文学习(MIC),并构建了13个任务,共计13K个带注释样本。在数据构建过程中,对于MII和MKS,我们从手动注释中提取正确选项,并创建具有挑战性的干扰项,以获得多项选择题。对于MIC,为了进行深入评估,我们设置了四个子任务,并将原始数据集转换为上下文学习格式。我们在提出的MIBench上评估了几种开源MLLMs和闭源MLLMs。结果显示,尽管当前模型在单图像任务上表现出色,但面对多图像输入时存在明显缺陷,如细粒度感知混淆、有限的多图像推理和不稳定的上下文学习。MIBench中的带注释数据可在https://huggingface.co/datasets/StarBottle/MIBench获取。
English
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

Summary

AI-Generated Summary

PDF103November 28, 2024