自動幻覺:視覺語言模型的幻覺基準自動生成
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
June 16, 2024
作者: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha
cs.AI
摘要
大型視覺語言模型(LVLMs)會出現幻覺:圖像中的某些上下文提示可能會觸發語言模塊對異常或假設對象的過度自信和不正確推理。儘管已開發了一些基準來調查LVLM幻覺,但它們主要依賴於手工製作的邊緣情況,其失敗模式可能很難泛化,對其進行微調可能會削弱其有效性。這促使我們開發了第一個自動基準生成方法AUTOHALLUSION,該方法利用幾種主要策略來創建不同的幻覺示例。它通過探測LVLM中的語言模塊以獲取上下文提示,並使用這些提示來合成圖像:(1)添加與上下文提示不正常的對象;(2)對於兩個同時出現的對象,保留一個並排除另一個;或(3)刪除與上下文提示密切相關的對象。然後生成基於圖像的問題,其真實答案與語言模塊的先前答案相矛盾。模型必須克服上下文偏見和干擾以獲得正確答案,而不正確或不一致的答案表明出現幻覺。AUTOHALLUSION使我們能夠以最低成本創建新的基準,從而克服手工製作基準的脆弱性。它還揭示了常見的失敗模式和原因,提供了檢測、避免或控制幻覺的關鍵見解。對頂級LVLMs,例如GPT-4V(ision)、Gemini Pro Vision、Claude 3和LLaVA-1.5進行全面評估,顯示在AUTOHALLUSION的合成和真實世界數據集上誘發幻覺的成功率分別為97.7%和98.7%,為長期對抗幻覺鋪平了道路。
English
Large vision-language models (LVLMs) hallucinate: certain context cues in an
image may trigger the language module's overconfident and incorrect reasoning
on abnormal or hypothetical objects. Though a few benchmarks have been
developed to investigate LVLM hallucinations, they mainly rely on hand-crafted
corner cases whose fail patterns may hardly generalize, and finetuning on them
could undermine their validity. These motivate us to develop the first
automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few
principal strategies to create diverse hallucination examples. It probes the
language modules in LVLMs for context cues and uses them to synthesize images
by: (1) adding objects abnormal to the context cues; (2) for two co-occurring
objects, keeping one and excluding the other; or (3) removing objects closely
tied to the context cues. It then generates image-based questions whose
ground-truth answers contradict the language module's prior. A model has to
overcome contextual biases and distractions to reach correct answers, while
incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION
enables us to create new benchmarks at the minimum cost and thus overcomes the
fragility of hand-crafted benchmarks. It also reveals common failure patterns
and reasons, providing key insights to detect, avoid, or control
hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g.,
GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and
98.7% success rate of hallucination induction on synthetic and real-world
datasets of AUTOHALLUSION, paving the way for a long battle against
hallucinations.Summary
AI-Generated Summary