視覺乾草堆:回答關於圖像集的更難問題
Visual Haystacks: Answering Harder Questions About Sets of Images
July 18, 2024
作者: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
cs.AI
摘要
近期在大型多模態模型(LMMs)方面的進展在單圖像視覺問答領域取得了顯著進展。然而,這些模型在處理涵蓋大量圖像的查詢時面臨著重大挑戰,類似於真實世界的情境,例如搜索大型相冊、在互聯網上查找特定信息,或通過衛星圖像監控環境變化。本文探討了多圖像視覺問答(MIQA)任務:給定一組大量圖像和自然語言查詢,任務是生成相關且基於事實的回答。我們提出了一個新的公開基準,名為“視覺乾草堆(VHs)”,專門設計用於評估LMMs在視覺檢索和推理上的能力,這裡我們進行了全面的評估,顯示即使是強大的封閉源模型也面臨著重大困難。為了解決這些缺點,我們引入了MIRAGE(多圖像檢索增強生成),這是一個針對LMMs量身定制的新型檢索/問答框架,能夠有效應對MIQA的挑戰,並且在效率和準確性方面相對於基準方法實現了明顯的改進。我們的評估顯示,MIRAGE在VHs基準上超越了封閉源GPT-4o模型高達11%,並且在效率方面相對於以文本為重點的多階段方法實現了高達3.4倍的改進。
English
Recent advancements in Large Multimodal Models (LMMs) have made significant
progress in the field of single-image visual question answering. However, these
models face substantial challenges when tasked with queries that span extensive
collections of images, similar to real-world scenarios like searching through
large photo albums, finding specific information across the internet, or
monitoring environmental changes through satellite imagery. This paper explores
the task of Multi-Image Visual Question Answering (MIQA): given a large set of
images and a natural language query, the task is to generate a relevant and
grounded response. We propose a new public benchmark, dubbed "Visual Haystacks
(VHs)," specifically designed to evaluate LMMs' capabilities in visual
retrieval and reasoning over sets of unrelated images, where we perform
comprehensive evaluations demonstrating that even robust closed-source models
struggle significantly. Towards addressing these shortcomings, we introduce
MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA
framework tailored for LMMs that confronts the challenges of MIQA with marked
efficiency and accuracy improvements over baseline methods. Our evaluation
shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs
benchmark and offers up to 3.4x improvements in efficiency over text-focused
multi-stage approaches.Summary
AI-Generated Summary