ChatPaper.aiChatPaper

MME-CC:一项具有挑战性的多模态认知能力评估基准

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

November 5, 2025
作者: Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang
cs.AI

摘要

随着推理模型规模的快速扩张,多模态在人类认知中的核心作用日益凸显,推动了对视觉中心认知行为进行系统性探索的迫切需求。然而,现有多模态基准要么过度强调文本推理,要么未能系统化捕捉视觉中心认知行为,导致对多模态大语言模型认知能力的评估存在不足。为解决这一局限,我们提出MME-CC(多模态认知能力评估基准),该视觉基础基准将11项代表性推理任务归类为空间、几何和知识推理三大视觉信息基础类别,并从细粒度角度分析MLLMs在这些维度上的认知能力。基于MME-CC,我们对16个代表性MLLMs展开广泛实验。研究发现:闭源模型目前整体领先(如Gemini-2.5-Pro得分为42.66,GLM-4.5V为30.45),而空间与几何推理能力普遍薄弱(≤30%)。我们进一步识别出常见错误模式,包括方向判断失误、跨视角身份识别一致性脆弱、反事实指令遵循能力差等,并观察到思维链通常遵循“提取→推理→验证”三阶段流程且高度依赖视觉提取。本研究期望推动学界将MLLMs的认知能力作为评估与模型设计的核心考量。
English
As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.
PDF72December 1, 2025