HallusionBench：你看到的是你在想什么？还是你在想你看到的是什么？一个挑战GPT-4V(ision)、LLaVA-1.5和其他多模态模型的图像-上下文推理基准。

摘要

大型语言模型（LLMs）在与视觉模型对齐并集成到视觉语言模型（VLMs）中后，可以在图像推理任务中带来令人印象深刻的改进。最近发布的GPT-4V(ison)、LLaVA-1.5等模型证明了这一点。然而，在这些领先水平的LVLMs中，强大的语言先验可能是一把双刃剑：它们可能忽略图像背景，仅仅依赖于（甚至是矛盾的）语言先验进行推理。相比之下，VLMs中的视觉模块较弱，可能导致误导性的视觉表征，然后被LLMs误译为确信的错误。为研究这两种类型的VLM错误，即语言幻觉和视觉错觉，我们创建了HallusionBench，这是一个图像背景推理基准，即使对于GPT-4V和LLaVA-1.5也具有挑战性。我们对HallusionBench中的示例进行了详细分析，为我们提供了关于VLMs错觉或幻觉以及未来如何改进它们的新见解。该基准和代码库将在https://github.com/tianyi-lab/HallusionBench发布。

English

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. The benchmark and codebase will be released at https://github.com/tianyi-lab/HallusionBench.

HallusionBench：你看到的是你在想什么？还是你在想你看到的是什么？一个挑战GPT-4V(ision)、LLaVA-1.5和其他多模态模型的图像-上下文推理基准。

HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

摘要

Support