MMIU:多模態多圖像理解用於評估大視覺語言模型
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
August 5, 2024
作者: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
cs.AI
摘要
對於大型視覺語言模型(LVLMs)來說,處理多個影像的能力至關重要,以發展對場景更全面且細緻的理解。最近的多影像LVLMs已開始應對這一需求。然而,它們的評估並未跟上發展的步伐。為填補這一空白,我們引入了多模態多影像理解(MMIU)基準,這是一個全面的評估套件,旨在評估LVLMs在各種多影像任務上的表現。MMIU 包括7種類型的多影像關係、52個任務、77K 張影像和11K 精心策劃的多選問題,使其成為同類型中最廣泛的基準。我們對24個流行的LVLMs進行評估,包括開源和專有模型,揭示了在多影像理解方面存在顯著挑戰,特別是涉及空間理解的任務。即使是最先進的模型,如GPT-4o,在MMIU 上僅達到55.7% 的準確性。通過多方面的分析實驗,我們確定了關鍵的性能差距和限制,為未來模型和數據改進提供了寶貴的見解。我們的目標是通過MMIU 推進LVLM 研究和開發的前沿,使我們朝著實現複雜的多模態多影像用戶互動邁進。
English
The capability to process multiple images is crucial for Large
Vision-Language Models (LVLMs) to develop a more thorough and nuanced
understanding of a scene. Recent multi-image LVLMs have begun to address this
need. However, their evaluation has not kept pace with their development. To
fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU)
benchmark, a comprehensive evaluation suite designed to assess LVLMs across a
wide range of multi-image tasks. MMIU encompasses 7 types of multi-image
relationships, 52 tasks, 77K images, and 11K meticulously curated
multiple-choice questions, making it the most extensive benchmark of its kind.
Our evaluation of 24 popular LVLMs, including both open-source and proprietary
models, reveals significant challenges in multi-image comprehension,
particularly in tasks involving spatial understanding. Even the most advanced
models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through
multi-faceted analytical experiments, we identify key performance gaps and
limitations, providing valuable insights for future model and data
improvements. We aim for MMIU to advance the frontier of LVLM research and
development, moving us toward achieving sophisticated multimodal multi-image
user interactions.Summary
AI-Generated Summary