ChatPaper.aiChatPaper

MME-CC:一個具有挑戰性的多模態認知能力評估基準

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

November 5, 2025
作者: Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang
cs.AI

摘要

隨著推理模型規模的快速擴展,多模態在人類認知中的核心作用日益凸顯,這驅使人們愈發需要探究以視覺為核心的認知行為。然而,現有的多模態基準要麼過度側重文本推理,要麼未能系統性地捕捉以視覺為中心的認知行為,導致對多模態大語言模型認知能力的評估尚不充分。為解決這一局限,我們提出MME-CC(多模態認知能力評估基準),該視覺基礎基準將11項代表性推理任務歸類為空間推理、幾何推理與知識推理三大視覺信息基礎範疇,並對多模態大語言模型在各維度的認知能力進行細粒度分析。基於MME-CC,我們對16個代表性多模態大語言模型展開廣泛實驗。研究發現:閉源模型目前整體領先(如Gemini-2.5-Pro得分42.66,GLM-4.5V得分30.45),而空間與幾何推理能力普遍薄弱(≤30%)。我們進一步歸納出常見錯誤模式,包括方向辨識失誤、跨視角身份一致性脆弱、反事實指令遵循能力差等,並觀察到思維鏈通常遵循「提取→推理→驗證」的三階段流程,且高度依賴視覺信息提取。本研究期望推動學界將多模態大語言模型的認知能力作為評估與模型設計的核心維度。
English
As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.
PDF72December 1, 2025