论大型视觉-语言模型中视觉标记的认知不确定性对物体幻觉的影响

摘要

大型视觉语言模型（LVLMs）通过将视觉编码器（VE）与大规模语言模型相结合，已在多种任务中取得了显著成功。然而，LVLMs仍面临诸如物体幻觉（即生成输入图像中不存在物体的描述）等关键挑战。本文认为，视觉编码器内部的不确定视觉标记是导致物体幻觉的关键因素。我们的统计分析发现，具有高认知不确定性的视觉标记与幻觉现象之间存在正相关关系。此外，我们从理论和实证两方面证明，在早期视觉编码器层中，那些在微小对抗扰动下表现出较大表示偏差的视觉标记，往往指示着高认知不确定性。基于这些发现，我们提出了一种简单而有效的策略，仅通过修改视觉编码器来缓解物体幻觉。该方法包括一种利用对抗扰动高效识别不确定视觉标记的代理方法，以及一种在视觉编码器中间层的自注意力过程中屏蔽这些不确定视觉标记的技术，从而抑制它们对视觉编码的影响，进而减轻幻觉现象。大量实验表明，我们的方法显著减少了LVLMs中的物体幻觉，并能与其他现有技术协同工作。

English

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

论大型视觉-语言模型中视觉标记的认知不确定性对物体幻觉的影响

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

摘要

Support