MME-CC: 인지 능력에 대한 도전적인 다중 모달 평가 벤치마크

초록

사고 모델의 규모가 빠르게 확장됨에 따라 인간 인지에서 다중양식성의 핵심적 역할이 뚜렷이 부각되며, 시각 중심 인지 행동을 탐구할 필요성이 점차 증가하고 있습니다. 그러나 기존 다중양식 벤치마크는 텍스트 추론을 과도하게 강조하거나 시각 중심 인지 행동을 체계적으로 포착하는 데 한계가 있어 MLLM의 인지 능력을 충분히 평가하지 못하고 있습니다. 이러한 한계를 해결하기 위해 우리는 MME-CC(인지 능력 다중양식 평가 벤치마크)를 소개합니다. 이는 시각에 기반한 벤치마크로 11가지 대표적 추론 과제를 공간, 기하, 지식 기반 추론이라는 세 가지 기본 시각 정보 범주로 체계화하고, 이러한 차원에 걸친 MLLM의 인지 능력을 세분화하여 분석합니다. MME-CC를 기반으로 우리는 16개의 대표적 MLLM에 대한 광범위한 실험을 수행했습니다. 우리 연구는 현재 폐쇄형 모델이 전반적으로 우세하며(예: Gemini-2.5-Pro 42.66점 대 GLM-4.5V 30.45점), 공간 및 기하 추론 능력은 전반적으로 취약한 상태(30% 이하)로 남아 있음을 보여줍니다. 더 나아가 방향 인식 오류, 취약한 교차 시점 정체성 유지, 반사실적 지시에 대한 낮은 준수도 등 공통 오류 패턴을 확인하고, 사고 연쇄(Chain-of-Thought)가 일반적으로 시각 정보 추출에 크게 의존하는 세 단계 과정(추출 -> 추론 -> 검증)을 따름을 관찰했습니다. 이 연구가 MLLM의 인지 능력을 평가과 모델 설계의 핵심으로 삼는 방향 전환의 계기가 되기를 바랍니다.

English

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

MME-CC: 인지 능력에 대한 도전적인 다중 모달 평가 벤치마크

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

초록

Support