MME-CC: Eine anspruchsvolle multimodale Evaluierungsbenchmark für kognitive Fähigkeiten

papers.abstract

Während sich Reasoning-Modelle rasant vergrößern, tritt die essentielle Rolle der Multimodalität in der menschlichen Kognition immer deutlicher hervor, was einen wachsenden Bedarf an der Untersuchung visuell-zentrierter kognitiver Verhaltensweisen antreibt. Bisherige multimodale Benchmarks betonen jedoch entweder textbasiertes Reasoning zu stark oder erfassen visuell-zentrierte kognitive Verhaltensweisen nicht systematisch, sodass die kognitive Kapazität von MLLMs unzureichend bewertet wird. Um diese Lücke zu schließen, stellen wir MME-CC (Multi-Modal Evaluation Benchmark of Cognitive Capacity) vor, einen visuell verankerten Benchmark, der 11 repräsentative Reasoning-Aufgaben in drei grundlegende Kategorien visueller Information einteilt: räumliches, geometrisches und wissensbasiertes Reasoning. Er bietet zudem eine feingranulare Analyse der kognitiven Fähigkeiten von MLLMs in diesen Dimensionen. Auf Basis von MME-CC führen wir umfangreiche Experimente mit 16 repräsentativen MLLMs durch. Unsere Studie zeigt, dass Closed-Source-Modelle derzeit insgesamt führend sind (z.B. 42,66 für Gemini-2.5-Pro vs. 30,45 für GLM-4.5V), während räumliches und geometrisches Reasoning allgemein schwach ausgeprägt sind (≤ 30%). Wir identifizieren weiterhin häufige Fehlermuster, darunter Orientierungsfehler, fragile identitätsübergreifende Persistenz zwischen Ansichten und mangelnde Befolgung kontrafaktischer Instruktionen. Zudem beobachten wir, dass Chain-of-Thought typischerweise einem dreistufigen Prozess folgt (Extrahieren → Reasoning → Verifizieren) mit starker Abhängigkeit von der visuellen Extraktion. Wir hoffen, dass diese Arbeit einen Wandel anstößt, der die kognitive Kapazität von MLLMs sowohl in der Evaluation als auch im Modelldesign in den Mittelpunkt stellt.

English

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

MME-CC: Eine anspruchsvolle multimodale Evaluierungsbenchmark für kognitive Fähigkeiten

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

papers.abstract

Support