视觉-语言模型中的多对象幻觉

摘要

大型视觉语言模型（LVLMs）经常出现物体幻觉问题，会在给定图像中产生不存在的物体。尽管当前关于物体幻觉的基准主要集中在单个物体类别的存在上，而不是个体实体，但本研究系统地调查了多物体幻觉，研究模型在同时关注多个物体时如何错误理解（例如，创造不存在的物体或分心）。我们引入了基于识别的物体探测评估（ROPE），这是一种自动化评估协议，考虑了在测试过程中单个图像中物体类别的分布，并使用视觉指代提示来消除歧义。通过全面的实证研究和分析潜在导致多物体幻觉的因素，我们发现：（1）LVLMs在关注多个物体时比关注单个物体更容易出现幻觉。（2）被测试的物体类别分布会影响幻觉行为，表明LVLMs可能会遵循捷径和虚假相关性。（3）幻觉行为受数据特定因素、显著性和频率以及模型内在行为的影响。我们希望能够使LVLMs能够识别和推理出现在现实视觉场景中的多个物体，提供见解，并量化我们在减轻这些问题方面的进展。

English

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

视觉-语言模型中的多对象幻觉

Multi-Object Hallucination in Vision-Language Models

摘要

Support