视觉草垛:回答关于图像集合的更难问题
Visual Haystacks: Answering Harder Questions About Sets of Images
July 18, 2024
作者: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
cs.AI
摘要
最近大型多模态模型(LMMs)的进展在单图像视觉问答领域取得了显著进展。然而,这些模型在处理涉及大量图像的查询时面临重大挑战,类似于搜索大型相册、在互联网上查找特定信息或通过卫星图像监测环境变化等真实场景。本文探讨了多图像视觉问答(MIQA)任务:给定一组大量图像和自然语言查询,任务是生成相关且有根据的回答。我们提出了一个新的公共基准,名为“视觉干草堆(VHs)”,专门设计用于评估LMMs在视觉检索和推理上的能力,我们进行了全面评估,表明即使是强大的闭源模型也面临重大困难。为了解决这些缺陷,我们引入了MIRAGE(多图像检索增强生成),这是一个专为LMMs量身定制的检索/问答框架,能有效应对MIQA的挑战,并在效率和准确性方面相比基线方法有显著提升。我们的评估显示,MIRAGE在VHs基准测试中超越了闭源GPT-4o模型高达11%,并在效率上比以文本为重点的多阶段方法提供高达3.4倍的改进。
English
Recent advancements in Large Multimodal Models (LMMs) have made significant
progress in the field of single-image visual question answering. However, these
models face substantial challenges when tasked with queries that span extensive
collections of images, similar to real-world scenarios like searching through
large photo albums, finding specific information across the internet, or
monitoring environmental changes through satellite imagery. This paper explores
the task of Multi-Image Visual Question Answering (MIQA): given a large set of
images and a natural language query, the task is to generate a relevant and
grounded response. We propose a new public benchmark, dubbed "Visual Haystacks
(VHs)," specifically designed to evaluate LMMs' capabilities in visual
retrieval and reasoning over sets of unrelated images, where we perform
comprehensive evaluations demonstrating that even robust closed-source models
struggle significantly. Towards addressing these shortcomings, we introduce
MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA
framework tailored for LMMs that confronts the challenges of MIQA with marked
efficiency and accuracy improvements over baseline methods. Our evaluation
shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs
benchmark and offers up to 3.4x improvements in efficiency over text-focused
multi-stage approaches.Summary
AI-Generated Summary