在视觉和语言模型中对多图像理解进行基准测试：感知、知识、推理和多跳推理

摘要

大型语言模型（LLMs）的进步显著拓宽了自然语言处理应用的范围，多模态LLMs将这些能力扩展到整合和解释视觉数据。然而，现有的视觉语言模型（VLMs）基准主要关注单图像输入，忽略了多图像理解的关键方面。本文介绍了一个名为多图像关系基准（MIRB）的基准，旨在评估VLMs在比较、分析和推理多个图像时的能力。我们的基准包括四个类别：感知、视觉世界知识、推理和多跳推理。通过对各种开源和闭源模型进行全面评估，我们证明了虽然开源VLMs在单图像任务中表现接近GPT-4V，但在多图像推理任务中仍存在显著的性能差距。我们的研究结果还显示，即使是最先进的GPT-4V模型在我们的基准测试中也存在困难，突显了在这一领域需要进一步的研究和发展。我们相信我们的MIRB贡献可以作为开发下一代多模态模型的试验平台。

English

The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

在视觉和语言模型中对多图像理解进行基准测试：感知、知识、推理和多跳推理

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

摘要

Support