通过精炼文本嵌入减轻大型视觉语言模型中的幻觉问题
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
November 7, 2025
作者: Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang
cs.AI
摘要
在本研究中,我们发现主流LVLM架构存在对语言模态的内在偏好,这种偏差主要源于将视觉嵌入简单附加到输入文本序列的常见做法。为解决此问题,我们提出了一种简单而有效的方法,通过整合平均池化后的视觉特征来优化文本嵌入。实验表明,该方法在成熟基准测试中显著提升了视觉定位能力并有效减少了幻觉现象。虽然平均池化提供了一种简单、鲁棒且高效的视觉信息融合方式,但我们认为更复杂的融合方法有望进一步强化视觉定位与跨模态对齐能力。鉴于本文重点在于揭示模态不平衡问题及其对幻觉现象的影响——并证明利用视觉信息优化文本嵌入可缓解该问题——我们将更先进的融合策略探索留待未来研究。
English
In this work, we identify an inherent bias in prevailing LVLM architectures
toward the language modality, largely resulting from the common practice of
simply appending visual embeddings to the input text sequence. To address this,
we propose a simple yet effective method that refines textual embeddings by
integrating average-pooled visual features. Our approach demonstrably improves
visual grounding and significantly reduces hallucinations on established
benchmarks. While average pooling offers a straightforward, robust, and
efficient means of incorporating visual information, we believe that more
sophisticated fusion methods could further enhance visual grounding and
cross-modal alignment. Given that the primary focus of this work is to
highlight the modality imbalance and its impact on hallucinations -- and to
show that refining textual embeddings with visual information mitigates this
issue -- we leave exploration of advanced fusion strategies for future work.