Multimodale Inconsistentie Redenering (MMIR): Een Nieuwe Benchmark voor Multimodale Redeneermodellen

Samenvatting

Bestaande Multimodale Grote Taalmodellen (MLLMs) worden voornamelijk getraind en getest op consistente visueel-tekstuele invoer, waardoor de vraag open blijft of ze kunnen omgaan met inconsistenties in realistische, lay-outrijke content. Om deze kloof te overbruggen, stellen we de Multimodale Inconsistentie Redenering (MMIR) benchmark voor om het vermogen van MLLMs te beoordelen om semantische mismatches te detecteren en te redeneren over artefacten zoals webpagina's, presentatieslides en posters. MMIR bestaat uit 534 uitdagende voorbeelden, elk met synthetisch geïnjecteerde fouten in vijf redeneringsintensieve categorieën: Feitelijke Tegenstrijdigheid, Identiteitsmisattributie, Contextuele Mismatch, Kwantitatieve Discrepantie en Temporele/Ruimtelijke Incoherentie. We evalueren zes state-of-the-art MLLMs en tonen aan dat modellen met toegewijde multimodale redeneervaardigheden, zoals o1, hun tegenhangers aanzienlijk overtreffen, terwijl open-source modellen bijzonder kwetsbaar blijven voor inconsistentiefouten. Gedetailleerde foutenanalyses laten verder zien dat modellen uitblinken in het detecteren van inconsistenties die beperkt zijn tot één enkele modaliteit, met name tekst, maar moeite hebben met cross-modale conflicten en complexe lay-outs. Verkenningsexperimenten onthullen dat prompting met één enkele modaliteit, inclusief Chain-of-Thought (CoT) en Set-of-Mark (SoM) methoden, slechts marginale verbeteringen oplevert, wat een belangrijk knelpunt in cross-modale redenering blootlegt. Onze bevindingen onderstrepen de noodzaak van geavanceerde multimodale redenering en wijzen op toekomstig onderzoek naar multimodale inconsistentie.

English

Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs' ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors. Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.

Multimodale Inconsistentie Redenering (MMIR): Een Nieuwe Benchmark voor Multimodale Redeneermodellen

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Samenvatting

Support