MAD:模态自适应解码策略——缓解多模态大语言模型中的跨模态幻觉问题
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
January 29, 2026
作者: Sangyun Chung, Se Yeon Kim, Youngchae Chee, Yong Man Ro
cs.AI
摘要
多模态大语言模型(MLLMs)存在跨模态幻觉问题,即某一模态不适当地影响另一模态的生成内容,导致输出结果失真。这暴露了模态交互控制中存在更深层次的缺陷。为解决该问题,我们提出模态自适应解码(MAD)方法,这是一种无需训练的技术,能根据任务需求自适应调整各模态解码分支的权重。MAD通过模型自省机制获取任务所需的模态信息,利用其固有的模态相关性自评估能力。提取的模态概率被用于动态加权对比解码分支,使模型能聚焦相关信息并抑制跨模态干扰。在CMM和AVHBench数据集上的大量实验表明,MAD显著降低了多款音视频语言模型的跨模态幻觉(VideoLLaMA2-AV提升7.8%和2.0%,Qwen2.5-Omni提升8.7%和4.7%)。我们的研究证明,通过自省实现显式模态感知对鲁棒的多模态推理至关重要,为现有对比解码方法提供了理论扩展。代码已开源:https://github.com/top-yun/MAD
English
Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\% and 2.0\% improvements for VideoLLaMA2-AV, 8.7\% and 4.7\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at https://github.com/top-yun/MAD{https://github.com/top-yun/MAD}