HallusionBench:你看到的是你想到的嗎?還是你想到的是你看到的?一個挑戰GPT-4V(ision)、LLaVA-1.5和其他多模型的圖像-上下文推理基準測試。
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
October 23, 2023
作者: Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
cs.AI
摘要
大型語言模型(LLMs)在與視覺模型對齊並融入視覺-語言模型(VLMs)後,能夠顯著改善圖像推理任務。最近發布的GPT-4V(ison)、LLaVA-1.5等模型已經證明了這一點。然而,在這些領先技術的LVLMs中,強大的語言先驗可能是一把雙刃劍:它們可能會忽略圖像上下文,僅僅依賴(甚至是矛盾的)語言先驗進行推理。相比之下,VLMs中的視覺模塊比LLMs弱,可能導致誤導性的視覺表示,這些表示被LLMs轉化為自信的錯誤。為了研究這兩種VLM錯誤類型,即語言幻覺和視覺幻覺,我們建立了HallusionBench,這是一個對於即使是GPT-4V和LLaVA-1.5來說仍具有挑戰性的圖像上下文推理基準。我們對HallusionBench中的示例進行了詳細分析,這為我們對VLMs的幻覺或幻覺以及未來如何改進它們提供了新的見解。這個基準和代碼將在https://github.com/tianyi-lab/HallusionBench 上發布。
English
Large language models (LLMs), after being aligned with vision models and
integrated into vision-language models (VLMs), can bring impressive improvement
in image reasoning tasks. This was shown by the recently released GPT-4V(ison),
LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a
double-edged sword: they may ignore the image context and solely rely on the
(even contradictory) language prior for reasoning. In contrast, the vision
modules in VLMs are weaker than LLMs and may result in misleading visual
representations, which are then translated to confident mistakes by LLMs. To
study these two types of VLM mistakes, i.e., language hallucination and visual
illusion, we curated HallusionBench, an image-context reasoning benchmark that
is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed
analysis of examples in HallusionBench, which sheds novel insights on the
illusion or hallucination of VLMs and how to improve them in the future. The
benchmark and codebase will be released at
https://github.com/tianyi-lab/HallusionBench.