在视觉和语言模型中对多图像理解进行基准测试:感知、知识、推理和多跳推理
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
June 18, 2024
作者: Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales
cs.AI
摘要
大型语言模型(LLMs)的进步显著拓宽了自然语言处理应用的范围,多模态LLMs将这些能力扩展到整合和解释视觉数据。然而,现有的视觉语言模型(VLMs)基准主要关注单图像输入,忽略了多图像理解的关键方面。本文介绍了一个名为多图像关系基准(MIRB)的基准,旨在评估VLMs在比较、分析和推理多个图像时的能力。我们的基准包括四个类别:感知、视觉世界知识、推理和多跳推理。通过对各种开源和闭源模型进行全面评估,我们证明了虽然开源VLMs在单图像任务中表现接近GPT-4V,但在多图像推理任务中仍存在显著的性能差距。我们的研究结果还显示,即使是最先进的GPT-4V模型在我们的基准测试中也存在困难,突显了在这一领域需要进一步的研究和发展。我们相信我们的MIRB贡献可以作为开发下一代多模态模型的试验平台。
English
The advancement of large language models (LLMs) has significantly broadened
the scope of applications in natural language processing, with multi-modal LLMs
extending these capabilities to integrate and interpret visual data. However,
existing benchmarks for visual language models (VLMs) predominantly focus on
single-image inputs, neglecting the crucial aspect of multi-image
understanding. In this paper, we introduce a Multi-Image Relational Benchmark
MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across
multiple images. Our benchmark encompasses four categories: perception, visual
world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive
evaluation of a wide range of open-source and closed-source models, we
demonstrate that while open-source VLMs were shown to approach the performance
of GPT-4V in single-image tasks, a significant performance gap remains in
multi-image reasoning tasks. Our findings also reveal that even the
state-of-the-art GPT-4V model struggles with our benchmark, underscoring the
need for further research and development in this area. We believe our
contribution of MIRB could serve as a testbed for developing the
next-generation multi-modal models.Summary
AI-Generated Summary