MMIU：多模態多圖像理解用於評估大視覺語言模型

摘要

對於大型視覺語言模型（LVLMs）來說，處理多個影像的能力至關重要，以發展對場景更全面且細緻的理解。最近的多影像LVLMs已開始應對這一需求。然而，它們的評估並未跟上發展的步伐。為填補這一空白，我們引入了多模態多影像理解（MMIU）基準，這是一個全面的評估套件，旨在評估LVLMs在各種多影像任務上的表現。MMIU 包括7種類型的多影像關係、52個任務、77K 張影像和11K 精心策劃的多選問題，使其成為同類型中最廣泛的基準。我們對24個流行的LVLMs進行評估，包括開源和專有模型，揭示了在多影像理解方面存在顯著挑戰，特別是涉及空間理解的任務。即使是最先進的模型，如GPT-4o，在MMIU 上僅達到55.7% 的準確性。通過多方面的分析實驗，我們確定了關鍵的性能差距和限制，為未來模型和數據改進提供了寶貴的見解。我們的目標是通過MMIU 推進LVLM 研究和開發的前沿，使我們朝著實現複雜的多模態多影像用戶互動邁進。

English

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development, moving us toward achieving sophisticated multimodal multi-image user interactions.

MMIU：多模態多圖像理解用於評估大視覺語言模型

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

摘要

Support