通过同心因果注意力减轻物体幻觉。

摘要

最近的大型视觉语言模型（LVLMs）展现出令人瞩目的零翻译对话和推理能力，针对多模态查询。然而，它们存在物体幻觉问题，即LVLMs倾向于生成与图像输入事实不符的文本响应的现象。我们的试点研究揭示了物体幻觉与旋转位置编码（RoPE）密切相关，RoPE是现有LVLMs中广泛采用的位置依赖建模设计。由于RoPE中存在的长期衰减，LVLMs在多模态输入序列中相关视觉线索远离指令标记时更容易产生幻觉。此外，我们观察到当在多模态对齐期间颠倒视觉标记的顺序时会出现类似效应。我们的测试表明，RoPE中的长期衰减对LVLMs在捕捉长距离的视觉-指令交互作用时构成挑战。我们提出同心因果注意力（CCA），这是一种简单而有效的位置对齐策略，通过自然减少视觉和指令标记之间的相对距离，缓解LVLMs中RoPE长期衰减的影响。借助CCA，视觉标记可以更好地与指令标记进行交互，从而增强模型的感知能力并减轻物体幻觉。在不添加炫耀的情况下，我们的位置对齐方法在多个物体幻觉基准测试中大幅超越现有的幻觉缓解策略。

English

Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model's perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.

通过同心因果注意力减轻物体幻觉。

Mitigating Object Hallucination via Concentric Causal Attention

摘要

Support