MARBLE: Een Uitdagende Benchmark voor Multimodale Ruimtelijke Redenering en Planning

Samenvatting

Het vermogen om informatie uit meerdere modaliteiten te verwerken en stap voor stap te redeneren blijft een cruciale uitdaging in de vooruitgang van kunstmatige intelligentie. Bestaande redeneerbenchmarks richten zich echter alleen op tekstueel redeneren of gebruiken multimodale vragen die beantwoord kunnen worden door informatie rechtstreeks uit een niet-tekstuele modaliteit op te halen. Hierdoor blijft complex redeneren in multimodale domeinen slecht begrepen. Hier presenteren we MARBLE, een uitdagende multimodale redeneerbenchmark die is ontworpen om multimodale taalmodellen (MLLMs) te testen op hun vermogen om zorgvuldig stap voor stap te redeneren door complexe multimodale problemen en omgevingen. MARBLE bestaat uit twee zeer uitdagende taken, M-Portal en M-Cube, die het opstellen en begrijpen van meerstappenplannen vereisen onder ruimtelijke, visuele en fysieke beperkingen. We constateren dat huidige MLLMs slecht presteren op MARBLE — alle 12 geavanceerde modellen behalen bijna willekeurige prestaties op M-Portal en 0% nauwkeurigheid op M-Cube. Alleen in vereenvoudigde subtaken presteren sommige modellen beter dan de willekeurige basislijn, wat aangeeft dat complex redeneren nog steeds een uitdaging is voor bestaande MLLMs. Bovendien tonen we aan dat perceptie een knelpunt blijft, waarbij MLLMs soms falen om informatie uit de visuele invoer te extraheren. Door de beperkingen van MLLMs te belichten, hopen we dat MARBLE de ontwikkeling van de volgende generatie modellen zal stimuleren, met het vermogen om te redeneren en te plannen over vele, multimodale redeneerstappen.

English

The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.

MARBLE: Een Uitdagende Benchmark voor Multimodale Ruimtelijke Redenering en Planning

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Samenvatting

Support