ChatPaper.aiChatPaper

MIBench:評估多模態大型語言模型對多個圖像的表現

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

July 21, 2024
作者: Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
cs.AI

摘要

建立在LLM強大基礎上,近期許多多模態大型語言模型(MLLMs)在各種視覺語言任務上取得了顯著的表現,跨越多個基準測試。然而,大多數現有的MLLMs和基準測試主要集中在單圖像輸入情境,導致MLLMs在處理現實多圖像時的表現仍未被充分探討。雖然有一些基準測試考慮了多圖像情境,但其評估維度和樣本非常有限。因此,在本文中,我們提出了一個新的基準測試MIBench,以全面評估MLLMs在多圖像情境中的細粒度能力。具體而言,MIBench將多圖像能力分為三個情境:多圖像指導(MII)、多模態知識尋求(MKS)和多模態上下文學習(MIC),並構建了13個任務,總共包含13K個標註樣本。在數據構建過程中,對於MII和MKS,我們從手動標註中提取正確選項並創建具有挑戰性的干擾項,以獲得多選問題。對於MIC,為了進行深入評估,我們設置了四個子任務,並將原始數據集轉換為上下文學習格式。我們在提出的MIBench上評估了幾個開源MLLMs和封閉源MLLMs。結果顯示,儘管當前模型在單圖像任務上表現出色,但面對多圖像輸入時存在顯著缺陷,例如細粒度感知混亂、有限的多圖像推理和不穩定的上下文學習。MIBench中的標註數據可在https://huggingface.co/datasets/StarBottle/MIBench找到。
English
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

Summary

AI-Generated Summary

PDF103November 28, 2024