MAD：模态自适应解码——缓解多模态大语言模型中的跨模态幻觉

摘要

多模态大语言模型（MLLMs）存在跨模态幻觉问题，即某一模态不适当地影响其他模态的生成内容，导致虚构输出。这暴露了模态交互控制中更深层的缺陷。为此，我们提出模态自适应解码（MAD），这是一种无需训练的方法，能根据任务需求自适应地加权特定模态的解码分支。MAD通过查询每个任务所需模态，利用模型固有的模态相关性自评估能力。提取的模态概率随后用于自适应加权对比解码分支，使模型能聚焦相关信息并抑制跨模态干扰。在CMM和AVHBench上的大量实验表明，MAD显著降低了多款音视频语言模型的跨模态幻觉（VideoLLaMA2-AV提升7.8%和2.0%，Qwen2.5-Omni提升8.7%和4.7%）。我们的方法证明，通过自评估实现的显式模态感知对稳健的多模态推理至关重要，为现有对比解码方法提供了理论扩展。代码已开源：https://github.com/top-yun/MAD。

English

Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\% and 2.0\% improvements for VideoLLaMA2-AV, 8.7\% and 4.7\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at https://github.com/top-yun/MAD{https://github.com/top-yun/MAD}

MAD：模态自适应解码——缓解多模态大语言模型中的跨模态幻觉

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

摘要

Support