在視覺和語言模型中進行多圖像理解的基準測試:感知、知識、推理和多躍推理
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
June 18, 2024
作者: Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales
cs.AI
摘要
大型語言模型(LLMs)的進步顯著擴展了自然語言處理應用的範圍,多模態LLMs將這些能力擴展到整合和解釋視覺數據。然而,現有的視覺語言模型(VLMs)基準主要聚焦於單圖像輸入,忽略了多圖像理解的關鍵方面。本文介紹了一個名為多圖像關係基準(MIRB)的基準,旨在評估VLMs在比較、分析和推理多個圖像時的能力。我們的基準包含四個類別:感知、視覺世界知識、推理和多跳推理。通過對各種開源和封閉源模型的全面評估,我們證明了,儘管開源VLMs在單圖像任務中表現接近GPT-4V,但在多圖像推理任務中仍存在顯著的性能差距。我們的研究結果還顯示,即使是最先進的GPT-4V模型在我們的基準下也遇到困難,強調了在這一領域進行進一步研究和開發的必要性。我們相信我們的MIRB貢獻可以作為發展下一代多模態模型的試驗平臺。
English
The advancement of large language models (LLMs) has significantly broadened
the scope of applications in natural language processing, with multi-modal LLMs
extending these capabilities to integrate and interpret visual data. However,
existing benchmarks for visual language models (VLMs) predominantly focus on
single-image inputs, neglecting the crucial aspect of multi-image
understanding. In this paper, we introduce a Multi-Image Relational Benchmark
MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across
multiple images. Our benchmark encompasses four categories: perception, visual
world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive
evaluation of a wide range of open-source and closed-source models, we
demonstrate that while open-source VLMs were shown to approach the performance
of GPT-4V in single-image tasks, a significant performance gap remains in
multi-image reasoning tasks. Our findings also reveal that even the
state-of-the-art GPT-4V model struggles with our benchmark, underscoring the
need for further research and development in this area. We believe our
contribution of MIRB could serve as a testbed for developing the
next-generation multi-modal models.Summary
AI-Generated Summary