對抗模態衝突的魯棒多模態大型語言模型

摘要

儘管多模態大型語言模型（MLLMs）在視覺-語言任務中展現出令人矚目的能力，但在實際應用場景中，它們易於產生幻覺現象。本文從模態衝突的角度探討了MLLMs中的幻覺現象。與現有研究聚焦於模型響應與輸入之間的衝突不同，我們研究了來自不同模態的輸入中固有的衝突，這些衝突使MLLMs陷入困境並直接導致幻覺的產生。我們正式定義了模態衝突，並構建了一個名為多模態模態衝突（MMMC）的數據集，以模擬視覺-語言任務中的這一現象。提出了基於提示工程、監督微調和強化學習的三種方法來緩解由模態衝突引起的幻覺。在MMMC數據集上進行了大量實驗，以分析這些方法的優缺點。我們的結果表明，強化學習方法在緩解模態衝突下的幻覺方面表現最佳，而監督微調方法則展現出穩定且具有前景的性能。我們的工作揭示了導致幻覺的未被注意的模態衝突，並為MLLMs的魯棒性提供了更多洞見。

English

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

對抗模態衝突的魯棒多模態大型語言模型

Robust Multimodal Large Language Models Against Modality Conflict

摘要

Support