鲁棒的多模态大语言模型应对模态冲突
Robust Multimodal Large Language Models Against Modality Conflict
July 9, 2025
作者: Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在视觉-语言任务中展现出令人瞩目的能力,但在现实场景中它们易产生幻觉现象。本文从模态冲突的角度探讨了MLLMs中的幻觉问题。与现有研究聚焦于模型响应与输入之间的冲突不同,我们研究了来自不同模态的输入中固有的冲突,这些冲突使MLLMs陷入两难境地,并直接导致幻觉的产生。我们正式定义了模态冲突,并构建了一个名为多模态冲突数据集(MMMC)的模拟视觉-语言任务中这一现象的数据集。提出了基于提示工程、监督微调和强化学习的三种方法,以缓解由模态冲突引发的幻觉。在MMMC数据集上进行了大量实验,分析这些方法的优缺点。结果表明,强化学习方法在缓解模态冲突下的幻觉方面表现最佳,而监督微调方法则展现出稳定且具有前景的性能。我们的工作揭示了导致幻觉的未被注意的模态冲突,并为MLLMs的鲁棒性提供了更多洞见。
English
Despite the impressive capabilities of multimodal large language models
(MLLMs) in vision-language tasks, they are prone to hallucinations in
real-world scenarios. This paper investigates the hallucination phenomenon in
MLLMs from the perspective of modality conflict. Unlike existing works focusing
on the conflicts between model responses and inputs, we study the inherent
conflicts in inputs from different modalities that place MLLMs in a dilemma and
directly lead to hallucinations. We formally define the modality conflict and
construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this
phenomenon in vision-language tasks. Three methods based on prompt engineering,
supervised fine-tuning, and reinforcement learning are proposed to alleviate
the hallucination caused by modality conflict. Extensive experiments are
conducted on the MMMC dataset to analyze the merits and demerits of these
methods. Our results show that the reinforcement learning method achieves the
best performance in mitigating the hallucination under modality conflict, while
the supervised fine-tuning method shows promising and stable performance. Our
work sheds light on the unnoticed modality conflict that leads to
hallucinations and provides more insights into the robustness of MLLMs.