多模態乾草堆中的針筒
Needle In A Multimodal Haystack
June 11, 2024
作者: Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
cs.AI
摘要
隨著多模態大型語言模型(MLLMs)的快速發展,它們的評估變得日益全面。然而,理解長篇多模態內容,作為現實應用的基礎能力,仍然未被充分探討。在這項工作中,我們提出了針對多模態大型語言模型理解長篇多模態文件能力的第一個基準測試,名為Needle In A Multimodal Haystack(MM-NIAH)。我們的基準測試包括三種類型的評估任務:多模態檢索、計數和推理。在每個任務中,模型需要根據給定的多模態文件中分散的不同關鍵信息來回答問題。通過在MM-NIAH上評估領先的MLLMs,我們觀察到現有模型在這些任務上仍有顯著的改進空間,特別是在以視覺為中心的評估上。我們希望這項工作能為進一步研究長篇多模態文件理解提供平台,並有助於推動MLLMs的發展。代碼和基準測試已在https://github.com/OpenGVLab/MM-NIAH 上發布。
English
With the rapid advancement of multimodal large language models (MLLMs), their
evaluation has become increasingly comprehensive. However, understanding long
multimodal content, as a foundational ability for real-world applications,
remains underexplored. In this work, we present Needle In A Multimodal Haystack
(MM-NIAH), the first benchmark specifically designed to systematically evaluate
the capability of existing MLLMs to comprehend long multimodal documents. Our
benchmark includes three types of evaluation tasks: multimodal retrieval,
counting, and reasoning. In each task, the model is required to answer the
questions according to different key information scattered throughout the given
multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that
existing models still have significant room for improvement on these tasks,
especially on vision-centric evaluation. We hope this work can provide a
platform for further research on long multimodal document comprehension and
contribute to the advancement of MLLMs. Code and benchmark are released at
https://github.com/OpenGVLab/MM-NIAH.Summary
AI-Generated Summary