モダリティ間の衝突に対するロバストなマルチモーダル大規模言語モデル

要旨

マルチモーダル大規模言語モデル（MLLMs）は視覚言語タスクにおいて驚くべき能力を発揮するものの、現実世界のシナリオでは幻覚（hallucination）を引き起こしやすい。本論文では、モダリティ間の衝突という観点からMLLMsにおける幻覚現象を調査する。既存研究がモデルの応答と入力間の衝突に焦点を当てるのに対し、本研究では異なるモダリティからの入力に内在する衝突に注目し、これがMLLMsをジレンマに陥れ、直接的に幻覚を引き起こすことを明らかにする。我々はモダリティ衝突を正式に定義し、視覚言語タスクにおけるこの現象をシミュレートするためにMultimodal Modality Conflict（MMMC）データセットを構築した。モダリティ衝突による幻覚を軽減するために、プロンプトエンジニアリング、教師ありファインチューニング、強化学習に基づく3つの手法を提案する。MMMCデータセットを用いた広範な実験を通じて、これらの手法の長所と短所を分析した。その結果、強化学習手法がモダリティ衝突下での幻覚軽減において最も優れた性能を発揮し、教師ありファインチューニング手法は有望で安定した性能を示すことがわかった。本研究は、幻覚を引き起こす見過ごされていたモダリティ衝突に光を当て、MLLMsのロバスト性に関するさらなる洞察を提供するものである。

English

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

モダリティ間の衝突に対するロバストなマルチモーダル大規模言語モデル

Robust Multimodal Large Language Models Against Modality Conflict

要旨

Support