MAD: Modaliteitsadaptief Decoderen voor het Verminderen van Cross-modale Hallucinaties in Multimodale Grote Taalmodellen

Samenvatting

Multimodale Large Language Models (MLLMs) kampen met cross-modale hallucinaties, waarbij één modaliteit de generatie over een andere modaliteit onterecht beïnvloedt, wat leidt tot gefabriceerde output. Dit onthult een fundamenteelere tekortkoming in de controle van modaliteitsinteractie. Om dit aan te pakken, stellen we Modality-Adaptive Decoding (MAD) voor, een trainingsvrije methode die adaptief de gewichten van modaliteit-specifieke decodetakken aanpast op basis van taakeisen. MAD benut het inherente vermogen van het model om de relevantie van modaliteiten zelf in te schatten door te bevragen welke modaliteiten voor elke taak nodig zijn. De verkregen modaliteitskansen worden vervolgens gebruikt om contrastieve decodetakken adaptief te wegen, waardoor het model zich kan concentreren op relevante informatie terwijl cross-modale interferentie wordt onderdrukt. Uitgebreide experimenten op CMM en AVHBench tonen aan dat MAD cross-modale hallucinaties significant vermindert across multiple audio-visuele taalmodellen (verbeteringen van 7.8% en 2.0% voor VideoLLaMA2-AV, en 8.7% en 4.7% voor Qwen2.5-Omni). Onze aanpak toont aan dat expliciet modaliteitsbewustzijn via zelfevaluatie cruciaal is voor robuuste multimodale redenering, en biedt een principele uitbreiding van bestaande contrastieve decodeermethoden. Onze code is beschikbaar op https://github.com/top-yun/MAD.

English

Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\% and 2.0\% improvements for VideoLLaMA2-AV, 8.7\% and 4.7\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at https://github.com/top-yun/MAD{https://github.com/top-yun/MAD}

MAD: Modaliteitsadaptief Decoderen voor het Verminderen van Cross-modale Hallucinaties in Multimodale Grote Taalmodellen

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Samenvatting

Support