ChatPaper.aiChatPaper

修正不平衡的注意力以減輕大型視覺語言模型中的上下文幻覺

Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model

January 21, 2025
作者: Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
cs.AI

摘要

大視覺語言模型(LVLMs)展現了卓越的能力,能夠理解和描述視覺內容,在各種視覺語言任務中實現了最先進的性能。然而,這些模型經常表現出幻覺行為,即生成包含輸入圖像中不存在的物體或細節的描述。我們的研究通過分析變壓器層和注意力頭之間的注意力模式,揭示了幻覺通常源於在更深層次中視覺基礎的逐漸退化。我們提出了一種新穎的注意力修改方法,結合選擇性標記強調和頭部特定調節,以在生成過程中保持視覺基礎。我們的方法引入了兩個關鍵組件:(1)一種雙流標記選擇機制,識別並優先考慮具有局部信息和空間重要性的視覺標記,以及(2)一種注意力頭特定調節策略,根據個別注意力頭的視覺敏感度來差異性地放大視覺信息處理。通過對MSCOCO數據集的大量實驗,我們展示了我們的方法將幻覺率降低了高達62.3%,而與基準模型相比保持了可比的任務性能。我們的分析表明,通過有選擇性地調節具有不同視覺敏感度水平的注意力頭之間的標記,可以顯著改善視覺基礎,而無需重新訓練模型。
English
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.

Summary

AI-Generated Summary

PDF42January 22, 2025