DASH:視覺語言模型系統性幻覺的檢測與評估
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
March 30, 2025
作者: Maximilian Augustin, Yannic Neuhaus, Matthias Hein
cs.AI
摘要
視覺語言模型(VLMs)容易出現物體幻覺,即錯誤地指出圖像中存在某些物體。現有的基準測試使用相對較小的標註數據集來量化幻覺。然而,這種方法存在兩個問題:i) 不足以評估在開放世界環境中廣泛使用VLMs時產生的幻覺,ii) 無法有效檢測VLMs中的系統性錯誤。我們提出了DASH(Detection and Assessment of Systematic Hallucinations),這是一個自動化、大規模的流程,旨在識別開放世界環境中VLMs在真實世界圖像上的系統性幻覺。其中一個關鍵組件是DASH-OPT,用於基於圖像的檢索,我們通過優化「自然圖像流形」來生成誤導VLM的圖像。DASH的輸出包括一組真實且語義相似的圖像,這些圖像會導致VLM產生物體幻覺。我們將DASH應用於PaliGemma和兩個LLaVA-NeXT模型,涵蓋380個物體類別,總共發現了超過19k個包含950k張圖像的集群。我們研究了這些識別出的系統性幻覺在其他VLMs中的轉移性,並展示了使用DASH獲得的模型特定圖像對PaliGemma進行微調可以減輕物體幻覺。代碼和數據可在https://YanNeu.github.io/DASH獲取。
English
Vision-language models (VLMs) are prone to object hallucinations, where they
erroneously indicate the presenceof certain objects in an image. Existing
benchmarks quantify hallucinations using relatively small, labeled datasets.
However, this approach is i) insufficient to assess hallucinations that arise
in open-world settings, where VLMs are widely used, and ii) inadequate for
detecting systematic errors in VLMs. We propose DASH (Detection and Assessment
of Systematic Hallucinations), an automatic, large-scale pipeline designed to
identify systematic hallucinations of VLMs on real-world images in an
open-world setting. A key component is DASH-OPT for image-based retrieval,
where we optimize over the ''natural image manifold'' to generate images that
mislead the VLM. The output of DASH consists of clusters of real and
semantically similar images for which the VLM hallucinates an object. We apply
DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in
total, find more than 19k clusters with 950k images. We study the transfer of
the identified systematic hallucinations to other VLMs and show that
fine-tuning PaliGemma with the model-specific images obtained with DASH
mitigates object hallucinations. Code and data are available at
https://YanNeu.github.io/DASH.Summary
AI-Generated Summary