鲁棒的多模态大语言模型应对模态冲突

摘要

尽管多模态大语言模型（MLLMs）在视觉-语言任务中展现出令人瞩目的能力，但在现实场景中它们易产生幻觉现象。本文从模态冲突的角度探讨了MLLMs中的幻觉问题。与现有研究聚焦于模型响应与输入之间的冲突不同，我们研究了来自不同模态的输入中固有的冲突，这些冲突使MLLMs陷入两难境地，并直接导致幻觉的产生。我们正式定义了模态冲突，并构建了一个名为多模态冲突数据集（MMMC）的模拟视觉-语言任务中这一现象的数据集。提出了基于提示工程、监督微调和强化学习的三种方法，以缓解由模态冲突引发的幻觉。在MMMC数据集上进行了大量实验，分析这些方法的优缺点。结果表明，强化学习方法在缓解模态冲突下的幻觉方面表现最佳，而监督微调方法则展现出稳定且具有前景的性能。我们的工作揭示了导致幻觉的未被注意的模态冲突，并为MLLMs的鲁棒性提供了更多洞见。

English

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

鲁棒的多模态大语言模型应对模态冲突

Robust Multimodal Large Language Models Against Modality Conflict

摘要

Support