視覺語言模型中的多對象幻覺

摘要

大型視覺語言模型（LVLMs）常常受到物體幻覺的困擾，會在給定的圖像中產生不存在的物體。雖然目前關於物體幻覺的基準主要集中在單個物體類別的存在上，而非個別實體，本研究系統地探討多物體幻覺，研究模型在同時專注於多個物體時如何誤解（例如，創造不存在的物體或分心）。我們引入了基於識別的物體探測評估（ROPE），這是一個考慮在測試期間單個圖像中物體類別分佈並使用視覺參考提示來消除歧義的自動化評估協議。通過全面的實證研究和分析潛在導致多物體幻覺的因素，我們發現（1）LVLMs在專注於多個物體時比專注於單個物體時更容易出現幻覺。（2）測試的物體類別分佈會影響幻覺行為，表明LVLMs可能會遵循捷徑和虛假相關性。（3）幻覺行為受數據特定因素、顯著性和頻率以及模型內在行為的影響。我們希望能夠使LVLMs能夠識別並推理出現在現實視覺場景中的多個物體，提供見解，並量化我們在緩解問題方面的進展。

English

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

視覺語言模型中的多對象幻覺

Multi-Object Hallucination in Vision-Language Models

摘要

Support