多模态场景中的针对性搜索
Needle In A Multimodal Haystack
June 11, 2024
作者: Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
cs.AI
摘要
随着多模态大型语言模型(MLLMs)的快速发展,它们的评估变得日益全面。然而,作为现实世界应用的基础能力,理解长篇多模态内容仍未得到充分探讨。在这项工作中,我们提出了“多模态中的一根稻草”(MM-NIAH),这是第一个专门设计用于系统评估现有MLLMs理解长篇多模态文档能力的基准。我们的基准包括三种类型的评估任务:多模态检索、计数和推理。在每个任务中,模型需要根据给定多模态文档中分散的不同关键信息来回答问题。通过在MM-NIAH上评估领先的MLLMs,我们观察到现有模型在这些任务上仍有显著改进空间,特别是在以视觉为中心的评估上。我们希望这项工作能为长篇多模态文档理解的进一步研究提供平台,并促进MLLMs的发展。代码和基准可在https://github.com/OpenGVLab/MM-NIAH找到。
English
With the rapid advancement of multimodal large language models (MLLMs), their
evaluation has become increasingly comprehensive. However, understanding long
multimodal content, as a foundational ability for real-world applications,
remains underexplored. In this work, we present Needle In A Multimodal Haystack
(MM-NIAH), the first benchmark specifically designed to systematically evaluate
the capability of existing MLLMs to comprehend long multimodal documents. Our
benchmark includes three types of evaluation tasks: multimodal retrieval,
counting, and reasoning. In each task, the model is required to answer the
questions according to different key information scattered throughout the given
multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that
existing models still have significant room for improvement on these tasks,
especially on vision-centric evaluation. We hope this work can provide a
platform for further research on long multimodal document comprehension and
contribute to the advancement of MLLMs. Code and benchmark are released at
https://github.com/OpenGVLab/MM-NIAH.Summary
AI-Generated Summary