心智之眼:多模态大语言模型的视觉抽象、转换与组合基准
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
April 17, 2026
作者: Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian, Tanuja Ganu
cs.AI
摘要
多模态大语言模型(MLLMs)在视觉语言基准测试中取得了显著进展,但其视觉认知与空间推理能力仍待深入探究。我们推出“心灵之眼”——一个受经典人类智力测试启发、基于新型“A-R-T”分类法(抽象、关系与转换)构建的多选题基准,涵盖八项视觉认知任务。这些任务旨在探究流体智力的核心过程,包括模式归纳、类比关系映射和心理转换等。我们评估了多种闭源与开源MLLMs的表现,并将其与人类参与者进行对比。人类测试者达到80%的准确率,而表现最佳的MLLMs仍低于50%。错误分析揭示出模型存在三大缺陷:(1)视觉注意力分配不足;(2)内部感知操作能力缺失;(3)对底层视觉概念的抽象能力薄弱。研究表明,当前MLLMs在视觉空间推理能力上与人类存在显著差距,这凸显了建立更具认知科学依据的评估框架的必要性。
English
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.