視覺語言模型中的多對象幻覺
Multi-Object Hallucination in Vision-Language Models
July 8, 2024
作者: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai
cs.AI
摘要
大型視覺語言模型(LVLMs)常常受到物體幻覺的困擾,會在給定的圖像中產生不存在的物體。雖然目前關於物體幻覺的基準主要集中在單個物體類別的存在上,而非個別實體,本研究系統地探討多物體幻覺,研究模型在同時專注於多個物體時如何誤解(例如,創造不存在的物體或分心)。我們引入了基於識別的物體探測評估(ROPE),這是一個考慮在測試期間單個圖像中物體類別分佈並使用視覺參考提示來消除歧義的自動化評估協議。通過全面的實證研究和分析潛在導致多物體幻覺的因素,我們發現(1)LVLMs在專注於多個物體時比專注於單個物體時更容易出現幻覺。 (2)測試的物體類別分佈會影響幻覺行為,表明LVLMs可能會遵循捷徑和虛假相關性。 (3)幻覺行為受數據特定因素、顯著性和頻率以及模型內在行為的影響。我們希望能夠使LVLMs能夠識別並推理出現在現實視覺場景中的多個物體,提供見解,並量化我們在緩解問題方面的進展。
English
Large vision language models (LVLMs) often suffer from object hallucination,
producing objects not present in the given images. While current benchmarks for
object hallucination primarily concentrate on the presence of a single object
class rather than individual entities, this work systematically investigates
multi-object hallucination, examining how models misperceive (e.g., invent
nonexistent objects or become distracted) when tasked with focusing on multiple
objects simultaneously. We introduce Recognition-based Object Probing
Evaluation (ROPE), an automated evaluation protocol that considers the
distribution of object classes within a single image during testing and uses
visual referring prompts to eliminate ambiguity. With comprehensive empirical
studies and analysis of potential factors leading to multi-object
hallucination, we found that (1) LVLMs suffer more hallucinations when focusing
on multiple objects compared to a single object. (2) The tested object class
distribution affects hallucination behaviors, indicating that LVLMs may follow
shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced
by data-specific factors, salience and frequency, and model intrinsic
behaviors. We hope to enable LVLMs to recognize and reason about multiple
objects that often occur in realistic visual scenes, provide insights, and
quantify our progress towards mitigating the issues.Summary
AI-Generated Summary