透過精煉文本嵌入減輕大型視覺語言模型中的幻覺問題
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
November 7, 2025
作者: Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang
cs.AI
摘要
在本研究中,我們發現主流LVLM架構存在對語言模態的固有偏見,這主要源於現行常見的將視覺嵌入簡單附加於輸入文本序列的做法。為解決此問題,我們提出一種簡潔有效的方法,通過整合平均池化後的視覺特徵來優化文本嵌入。實驗表明,我們的方法在成熟基準測試中顯著提升了視覺定位能力並大幅減少幻覺現象。雖然平均池化提供了一種直觀、穩健且高效的視覺信息融合方式,但我們認為更精密的融合方法有望進一步增強視覺定位與跨模態對齊效果。考慮到本研究主要旨在揭示模態不平衡現象及其對幻覺的影響——並證明利用視覺信息優化文本嵌入可緩解此問題——我們將更先進的融合策略探索留待未來研究。
English
In this work, we identify an inherent bias in prevailing LVLM architectures
toward the language modality, largely resulting from the common practice of
simply appending visual embeddings to the input text sequence. To address this,
we propose a simple yet effective method that refines textual embeddings by
integrating average-pooled visual features. Our approach demonstrably improves
visual grounding and significantly reduces hallucinations on established
benchmarks. While average pooling offers a straightforward, robust, and
efficient means of incorporating visual information, we believe that more
sophisticated fusion methods could further enhance visual grounding and
cross-modal alignment. Given that the primary focus of this work is to
highlight the modality imbalance and its impact on hallucinations -- and to
show that refining textual embeddings with visual information mitigates this
issue -- we leave exploration of advanced fusion strategies for future work.