心智之眼：多模态大语言模型的视觉抽象、转换与组合基准评测

摘要

多模态大语言模型（MLLMs）在视觉语言基准测试中取得了显著进展，但其视觉认知与空间推理能力仍待深入探究。我们推出“心灵之眼”——一个受经典人类智力测试启发、基于新型“A-R-T”分类体系（抽象、关系与变换）构建的多选题基准，涵盖八项视觉认知任务。这些任务旨在探究流体智力的核心过程，如模式归纳、类比关系映射和心理转换。我们评估了多种闭源与开源MLLMs，并将其表现与人类参与者对比。人类参与者准确率达到80%，而表现最佳的MLLMs仍低于50%。错误分析揭示了三大失败原因：(i)视觉注意力分配不足，(ii)内部感知操作能力缺失，(iii)对潜在视觉概念的抽象能力薄弱。研究结果表明，与人类相比，当前MLLMs的视觉空间推理能力存在明显局限，这凸显了建立更具认知科学依据的评估框架的必要性。

English

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

心智之眼：多模态大语言模型的视觉抽象、转换与组合基准评测

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

摘要

Support