MIBench:評估多模態大型語言模型對多個圖像的表現
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
July 21, 2024
作者: Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
cs.AI
摘要
建立在LLM強大基礎上,近期許多多模態大型語言模型(MLLMs)在各種視覺語言任務上取得了顯著的表現,跨越多個基準測試。然而,大多數現有的MLLMs和基準測試主要集中在單圖像輸入情境,導致MLLMs在處理現實多圖像時的表現仍未被充分探討。雖然有一些基準測試考慮了多圖像情境,但其評估維度和樣本非常有限。因此,在本文中,我們提出了一個新的基準測試MIBench,以全面評估MLLMs在多圖像情境中的細粒度能力。具體而言,MIBench將多圖像能力分為三個情境:多圖像指導(MII)、多模態知識尋求(MKS)和多模態上下文學習(MIC),並構建了13個任務,總共包含13K個標註樣本。在數據構建過程中,對於MII和MKS,我們從手動標註中提取正確選項並創建具有挑戰性的干擾項,以獲得多選問題。對於MIC,為了進行深入評估,我們設置了四個子任務,並將原始數據集轉換為上下文學習格式。我們在提出的MIBench上評估了幾個開源MLLMs和封閉源MLLMs。結果顯示,儘管當前模型在單圖像任務上表現出色,但面對多圖像輸入時存在顯著缺陷,例如細粒度感知混亂、有限的多圖像推理和不穩定的上下文學習。MIBench中的標註數據可在https://huggingface.co/datasets/StarBottle/MIBench找到。
English
Built on the power of LLMs, numerous multimodal large language models (MLLMs)
have recently achieved remarkable performance on various vision-language tasks
across multiple benchmarks. However, most existing MLLMs and benchmarks
primarily focus on single-image input scenarios, leaving the performance of
MLLMs when handling realistic multiple images remain underexplored. Although a
few benchmarks consider multiple images, their evaluation dimensions and
samples are very limited. Therefore, in this paper, we propose a new benchmark
MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in
multi-image scenarios. Specifically, MIBench categorizes the multi-image
abilities into three scenarios: multi-image instruction (MII), multimodal
knowledge-seeking (MKS) and multimodal in-context learning (MIC), and
constructs 13 tasks with a total of 13K annotated samples. During data
construction, for MII and MKS, we extract correct options from manual
annotations and create challenging distractors to obtain multiple-choice
questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and
transform the original datasets into in-context learning formats. We evaluate
several open-source MLLMs and close-source MLLMs on the proposed MIBench. The
results reveal that although current models excel in single-image tasks, they
exhibit significant shortcomings when faced with multi-image inputs, such as
confused fine-grained perception, limited multi-image reasoning, and unstable
in-context learning. The annotated data in MIBench is available at
https://huggingface.co/datasets/StarBottle/MIBench.Summary
AI-Generated Summary