多模态的诅咒:评估大型多模态模型在语言、视觉和音频方面的幻觉
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
October 16, 2024
作者: Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
cs.AI
摘要
最近大型多模态模型(LMMs)的进展显著提高了在各种任务中的性能,正在努力进一步整合额外的模态,如视频和音频。然而,大多数现有的LMMs仍然容易出现幻觉,即事实多模态输入与生成的文本输出之间的差异,这限制了它们在各种实际场景中的适用性。本文首次系统调查了涉及三种最常见模态(语言、视觉和音频)的LMMs中的幻觉。我们的研究揭示了两个导致幻觉的关键因素:对单模态先验的过度依赖和虚假的跨模态相关性。为了解决这些挑战,我们引入了基准测试“多模态的诅咒”(CMM),全面评估LMMs中的幻觉,详细分析了其潜在问题。我们的发现突出了关键的脆弱性,包括模态整合不平衡和训练数据中的偏见,强调了需要平衡的跨模态学习和增强的幻觉缓解策略。基于我们的观察和发现,我们提出了可能增强LMMs可靠性的研究方向。
English
Recent advancements in large multimodal models (LMMs) have significantly
enhanced performance across diverse tasks, with ongoing efforts to further
integrate additional modalities such as video and audio. However, most existing
LMMs remain vulnerable to hallucinations, the discrepancy between the factual
multimodal input and the generated textual output, which has limited their
applicability in various real-world scenarios. This paper presents the first
systematic investigation of hallucinations in LMMs involving the three most
common modalities: language, visual, and audio. Our study reveals two key
contributors to hallucinations: overreliance on unimodal priors and spurious
inter-modality correlations. To address these challenges, we introduce the
benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates
hallucinations in LMMs, providing a detailed analysis of their underlying
issues. Our findings highlight key vulnerabilities, including imbalances in
modality integration and biases from training data, underscoring the need for
balanced cross-modal learning and enhanced hallucination mitigation strategies.
Based on our observations and findings, we suggest potential research
directions that could enhance the reliability of LMMs.Summary
AI-Generated Summary