Sull'Incertezza Epistemica dei Token Visivi per le Allucinazioni Oggettuali nei Grandi Modelli Visione-Linguaggio

Abstract

I grandi modelli visione-linguaggio (LVLM), che integrano un encoder visivo (VE) con un grande modello linguistico, hanno ottenuto un notevole successo in vari compiti. Tuttavia, permangono sfide cruciali nei LVLM, come l'allucinazione di oggetti, ovvero la generazione di descrizioni di oggetti che non sono presenti nell'immagine di input. In questo lavoro, sosteniamo che i token visivi incerti all'interno del VE siano un fattore chiave che contribuisce all'allucinazione di oggetti. La nostra analisi statistica ha rilevato che esistono correlazioni positive tra i token visivi con un'elevata incertezza epistemica e l'occorrenza di allucinazioni. Inoltre, dimostriamo sia teoricamente che empiricamente che i token visivi negli strati iniziali del VE che presentano grandi deviazioni di rappresentazione sotto piccole perturbazioni avversarie indicano un'elevata incertezza epistemica. Sulla base di questi risultati, proponiamo una strategia semplice ma efficace per mitigare l'allucinazione di oggetti modificando solo il VE. Il nostro metodo comprende un metodo proxy con perturbazioni avversarie per identificare in modo efficiente i token visivi incerti e un metodo per mascherare questi token visivi incerti durante il processo di self-attention negli strati intermedi del VE, sopprimendo la loro influenza sulla codifica visiva e quindi alleviando le allucinazioni. Esperimenti estesi dimostrano che il nostro metodo riduce significativamente le allucinazioni di oggetti nei LVLM e può funzionare in sinergia con altre tecniche precedenti.

English

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Sull'Incertezza Epistemica dei Token Visivi per le Allucinazioni Oggettuali nei Grandi Modelli Visione-Linguaggio

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Abstract

Support