Sobre a Incerteza Epistêmica de Tokens Visuais para Alucinações de Objetos em Modelos de Grande Escala de Visão e Linguagem

Resumo

Grandes modelos de visão e linguagem (LVLMs, do inglês Large Vision-Language Models), que integram um codificador visual (VE, do inglês Vision Encoder) com um grande modelo de linguagem, alcançaram sucesso notável em diversas tarefas. No entanto, ainda existem desafios cruciais nos LVLMs, como a alucinação de objetos, que ocorre quando o modelo gera descrições de objetos que não estão presentes na imagem de entrada. Aqui, argumentamos que tokens visuais incertos dentro do VE são um fator-chave que contribui para a alucinação de objetos. Nossa análise estatística revelou que há correlações positivas entre tokens visuais com alta incerteza epistêmica e a ocorrência de alucinações. Além disso, demonstramos teórica e empiricamente que tokens visuais nas camadas iniciais do VE que exibem grandes desvios de representação sob pequenas perturbações adversárias indicam alta incerteza epistêmica. Com base nessas descobertas, propomos uma estratégia simples, porém eficaz, para mitigar a alucinação de objetos modificando apenas o VE. Nosso método consiste em uma técnica proxy com perturbações adversárias para identificar tokens visuais incertos de forma eficiente e um método para mascarar esses tokens visuais incertos durante o processo de auto-atenção nas camadas intermediárias do VE, suprimindo sua influência na codificação visual e, assim, aliviando as alucinações. Experimentos extensivos mostram que nosso método reduz significativamente as alucinações de objetos em LVLMs e pode funcionar sinergicamente com outras técnicas existentes.

English

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Sobre a Incerteza Epistêmica de Tokens Visuais para Alucinações de Objetos em Modelos de Grande Escala de Visão e Linguagem

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Resumo

Support