MIBench：評估多模態大型語言模型對多個圖像的表現

摘要

建立在LLM強大基礎上，近期許多多模態大型語言模型（MLLMs）在各種視覺語言任務上取得了顯著的表現，跨越多個基準測試。然而，大多數現有的MLLMs和基準測試主要集中在單圖像輸入情境，導致MLLMs在處理現實多圖像時的表現仍未被充分探討。雖然有一些基準測試考慮了多圖像情境，但其評估維度和樣本非常有限。因此，在本文中，我們提出了一個新的基準測試MIBench，以全面評估MLLMs在多圖像情境中的細粒度能力。具體而言，MIBench將多圖像能力分為三個情境：多圖像指導（MII）、多模態知識尋求（MKS）和多模態上下文學習（MIC），並構建了13個任務，總共包含13K個標註樣本。在數據構建過程中，對於MII和MKS，我們從手動標註中提取正確選項並創建具有挑戰性的干擾項，以獲得多選問題。對於MIC，為了進行深入評估，我們設置了四個子任務，並將原始數據集轉換為上下文學習格式。我們在提出的MIBench上評估了幾個開源MLLMs和封閉源MLLMs。結果顯示，儘管當前模型在單圖像任務上表現出色，但面對多圖像輸入時存在顯著缺陷，例如細粒度感知混亂、有限的多圖像推理和不穩定的上下文學習。MIBench中的標註數據可在https://huggingface.co/datasets/StarBottle/MIBench找到。

English

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

MIBench：評估多模態大型語言模型對多個圖像的表現

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

摘要

Support